MsftBigData

by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

Time: 15 hours for 7 modules

Score: I got all the labs right and missed two questions in the quizzes. The final exam used an unexpected format. It was timed and consisted mostly of fill-in-the-blank responses. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes and was exhausted at the end. I only got 74% on the final (I don’t know how many I got wrong or the total number of questions), for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

by George Taniwaki

Did you watch the debate on Monday night? I did. But I am also very interested in the post-debate media coverage and analysis. This morning, two articles that combine big data and the debate caught my eye. Both are novel and much more interesting than the tired stories that simply show changes in polls after a debate.

First, the New York Time reports that during the presidential debate (between 9:00 and 10:30 PM EDT) there is high correlation between the Betfair prediction market for who will win the presidential election and afterhours S&P 500 futures prices (see chart 1).

PresidentSandP500

Chart 1. Betfair prediction market for Mrs. Clinton compared to S&P 500 futures. Courtesy of New York Times

Correlation between markets is not a new phenomena. For several decades financial analysts have measured the covariance between commodity prices, especially crude oil, and equity indices. But this is the first time I have seen an article illustrating the covariance between a “fun” market for guessing who will become president against a “real” market. Check out the two graphs above, the similarity in shape is striking, including the fact that both continue to rise for about an hour after the debate ended.

In real-time, while the debate was being broadcast, players on Betfair believed the chance Mrs. Clinton will win the election rose by 5 percent. Meanwhile, the price of S&P 500 futures rose by 0.6%, meaning investors (who may be the same speculators who play on Betfair) believed the stock market prices in November were likely to be higher than before the debates started. There was no other surprise economic news that evening, so the debate is the most likely explanation for the surge. Pretty cool.

If the two markets are perfectly correlated (they aren’t) and markets are perfectly efficient (they aren’t), then one can estimate the difference in equity futures market value between the two candidates. If a 5% decrease in likelihood of a Trump win translates to a 0.6% increase in equity futures values, then the difference between Mr. Trump or Mrs. Clinton being elected (a 100% change in probability) results in about a 12% or $1.2 trillion (the total market cap of the S&P 500 is about $10 trillion) change in market value. (Note that I assume perfect correlation between the S&P 500 futures market and the actual market for the stocks used to calculate the index.)

Further, nearly all capital assets (stocks, bonds, commodities, real estate) in the US are now highly correlated. So the total difference is about $24 trillion (assuming total assets in the US are $200 trillion). Ironically, this probably means Donald Trump would be financially better off if he were to lose the election.

****

The other article that caught my eye involves Google Trend data. According to the Washington Post, the phrase “registrarse para votar” was the third highest trending search term the day after the debate was broadcast. The number of searches is about four times higher than in the days prior to the debates (see chart 2). Notice the spike in searches matches a spike in Sep 2012 after the first Obama-Romney debate.

The article says that it is not clear if it was the debate itself that caused the increase or the fact that Google recently introduced Spanish-language voting guides to its automated Knowledge Box, which presumably led to more searches for “registrarse para votar”. (This is the problem with confounding events.)

After a bit of research, I discovered an even more interesting fact. The spike in searches did not stop on Sep 27. Today, on Sep 30, four days after the debates, the volume of searches is 10 times higher than on Sep 27, or a total of 40x higher than before the debate (see chart 3). The two charts are scaled to make the data comparable.

VotarWashPost

Chart 2. Searches for “registrarse para votar” past 5 years to Sep 27. Courtesy of Washington Post and Google Trends

VotarToday

Chart 3. Searches for “registrarse para votar” past 5 years to Sep 30. Courtesy of Google Trends

I wanted to see if the spike was due to the debate or due to the addition of Spanish voter information to the Knowledge Box. To do this, I compared “registrarse para votar” to “register to vote”. The red line in chart 4 shows Google Trend data for “register to vote” scaled so that the bump in Sept 2012 is the same height as in the charts above. I’d say the debate really had an unprecedented effect on interest in voting and the effect was probably bigger for Spanish speaking web users.

VoteToday

Chart 4. Searches for “register to vote” past 5 years to Sep 30. Courtesy of Google Trends

Finally, I wanted to see how the search requests were distributed geographically. The key here is that most Hispanic communities vote Democratic and many states with a large Hispanic population are already blue (such as California, Washington, New Mexico, New Jersey, and New York). The exception is Florida with a large population of Cuban immigrants who tend to vote Republican.

VotarRegionToday

Chart 5. Searches for “registrarse para votar” past 5 years to Sep 30 by county. Courtesy of Google Trends

If you are a supporter of Democrats like Mrs. Clinton, the good news is that a large number of queries are coming from Arizona, and Texas, two states where changes in demographics are slowly turning voting preferences from red to blue.

In Florida, it is not clear which candidate gains from increased number of Spanish-speaking voters. However, since the increase is a result of the debate (during which it was revealed that Mr. Trump had insulted and berated a beauty pageant winner from Venezuela, calling her “miss housekeeping”), I will speculate many newly registered voters are going to be Clinton supporters.

If the Google search trend continues, it may be driven by new reports that Mr. Trump may have violated the US sanctions forbidding business transactions in Cuba. Cuban-Americans searching for information on voter registration after hearing this story are more likely to favor Mrs. Clinton.

by George Taniwaki

Often times, one has lots of data to display that are grouped in pairs such as men vs. women. Further, we want to show the pairs but not compare them. Instead, we are more interested in comparing different groups within the pairs than between pairs within a group. For instance. our groups could be age and we are more interested in comparing 20-24 women to 25-29 women than we are between 20-24 women and 20-24 men.

The diverging stacked bar chart is a very good way to display a pair of values next to each other. To allow easier comparison of the length of adjacent bars, there is usually no white space between them. To allow easier comparison between the left and right bar, they usually have no space between them but are different colors. The values for the bars running to the left are not negative. They are positive, just like the values on the right. The left and right bars are paired and measure the values for two different related groups.

The most common use of the diverging stacked bar chart is to display age distribution of a population broken down by gender. This type of chart is often called a population pyramid.  An example of a population pyramid using data from the 2000 U.S. Census is shown in Figure 1.

PopulationPyramidUS2000

Figure 1. Population pyramid for the U.S. based on Census 2000 data. Image from Censusscope.

The population pyramid is a special case of the diverging stacked bar chart. Notice that  each of the horizontal bars is the same width and covers the same age range (except the oldest group). Thus, the height of each bar represents the same number of years and the stack of bars forms a vertical axis showing age. Similarly, the the area of each bar represents the proportion of the population in that age group and the area of all the bars shows the total size of the population. A well-drawn population pyramid shows three dimensions at once, age, gender, and counts.

The shape of a population pyramid tells a lot about the population growth (which itself is a result of economic and political conditions that affect fertility, infant survival, immigration and emigration, and longevity) of a group.Figure 2 shows the four commonly seen shapes for a population pyramid.

The two triangles at the right (labeled stage 1 and stage 2) describe a group with a combination of high birthrates, high emigration, and high mortality cause the number of young to greatly exceed the old. Several countries in Sub-Saharan Africa and India have population pyramids of this shape.

The flatter shape (labeled stage 3) describes a group where births, immigration/emigration, and mortality are in balance. Most of the developing countries and the U.S. have population pyramids of this shape.

Finally, the egg-shaped pyramid (labeled stage 4) has a base that is smaller than the center. This describes a group where a combination of low birth rate, high immigration rate, and low mortality causes a bulge in the middle. If the fertility rate is below the replacement rate (about 2.1 child per female lifetime) then the population is growing older and may even be shrinking. Nearly all of the developed countries and China have population pyramids of this shape.

PopulationPyramid

Figure 2. Four commonly seen shapes for population pyramid. Image from Wikipedia

If you would like to explore population pyramids on a national, state, and metro area basis, go to http://www.censusscope.org/us/chart_age.html

****

Diverging stacked bar charts can be used in cases where there are more than two categories. In a paper presented at the 2011 Joint Statistical Meeting, Naomi Robbins and Richard Heiberger suggest that Lickert scale data should be presented using this method. If the questionnaire uses the standard 5 point scale, they argue that the  “Strongly disagree,” “Disagree,” and half the “Neither agree nor disagree” counts should be shown on the left bar. The counts for “Strongly agree,” “Agree,” and half of “Neither agree nor disagree” should be shown on the right bar. An example is shown in Figure 3.

image

Figure 3. Diverging stacked bar chart used to display Lickert scale data. Image from 2011 JSM

I’ve tried a bunch of different ways of presenting Lickert scale data (as well as other scaled data for importance, satisfaction, and other opinions) and have never been happy with my efforts. I really like this technique. If you review the paper, you will see eight common methods for displaying Lickert scale data that the authors label as “Not recommended.” I’ve used many of them.

For instance, I’ve used the standard colored bar chart like the one shown in Figure 4. The problem is that every bar is the same length so the ends of the bars, which your eyes are drawn to don’t convey any data. All of the data is conveyed at the interior points of the bars. By comparison, in Figure 3, the data is conveyed by the lengths of each bar and the proportion of each bar that is filled with the darker shade of color.

image

Figure 4. Standard bar chart to display Lickert scale data. Image from 2011 JSM

So how do you create your own diverging stacked bar charts? If you are an R language user, you can use functions available in the HH package and the latticeExtra package for the R language. These functions are also available in the RExcel for R add-in for Excel on Windows.

If you are not an R user, you can create diverging stacked bar charts manually using Excel or Tableau. For instructions using Excel, Amy Emery has a good tutorial on Slideshare.net.

Incidentally, if you have time, check out some of Ms. Emery’s other slide shows, they are quite good and cover a range of topics. There is even one to help R novices like me get started in learning the language.

Much thanks to my friend and colleague Carol Borthwick for pointing me to this new use for the diverging stacked bar chart.

by George Taniwaki

This week’s issue of the New Engl J Med (subscription required) should be of special interest for those who follow kidney disease. The issue contains several articles on medical investigations into treatments and risk factors for  kidney disease along with related editorials. Unfortunately, most of the news is not good.

NewEnglJMed

However, there is an important lesson to gain from these studies. Scientific knowledge advances in two ways. First, is the knowledge gained by learning what works. There is the obvious clinical benefit of knowing what is the best treatment for a patient. But successful studies also point the direction for other researchers showing where they can expect the greatest promise for future investigation.

Yet failures are valuable learning experiences. Knowing what doesn’t work reduces the chance that doctors or patients will try the same therapy on their own. But an unsuccessful trial does not mean a line of research should be abandoned. Rather, a failure should teach us to look at root causes.

Every experiment or medical trial is expected to be successful (otherwise you should invest time and effort in a different project with a greater potential payoff). When it isn’t we are temporarily surprised. But that should lead to a new investigation as to why the trial did not work as intended. And that investigation will hopefully lead to new insights that can be added to the body of human knowledge.

Trial of ACE inhibitors and ARBs

In the first article, Linda Fried of the Univ. Pittsburgh School of Medicine and her coauthors examine the effect of angiotensin-converting–enzyme (ACE) inhibitors and angiotensin-receptor blockers (ARBs) on patients with type 2 diabetes who also have kidney disease. These two drugs are often prescribed for the treatment of hypertension and congestive heart failure.

For kidney patients, the use of these drugs was intended to slow the decline in glomerular filtration rate (GFR). Previous studies had shown that ACE inhibitors and ARBs could benefit patients who already showed signs of proteinuria (protein in the urine, a sign of kidney disease). The goal of this study was to see if prescribing ACE inhibitors and ARBs to kidney patients earlier could forestall the progress toward end-stage renal disease (ESRD).

Since the progress of kidney disease for a particular patient is uncertain and can take many years, this study required a large sample that would be willing to participate by taking a prescription drug (or a placebo) for a multiyear period.

As reported in the article, the trial was stopped after four years because of safety concerns. There were more adverse events  in the therapy group than in the placebo group. The most common problem was acute kidney injury with the next most common being hyperkalemia (high potassium levels in the blood that if untreated can cause irregular heart beat). Because the study was stopped early, we now know that the combination therapy of ACE inhibitors and ARBs can cause injury, but we don’t know if it can delay the onset of ESRD.

Trial of bardoxolone methyl

The next article (online first) by Dick de Zeeuw of the Univ. Groningen and coauthors summarizes the results of treating patients that have both type 2 diabetes and stage 4 kidney disease with bardoxolone methyl. This drug is an antioxidant that can taken orally and has been shown to reduce serum creatinine.

Similar to the ACE inhibitor and ARB study, the sample size was large and the test was intended to span several years. However, also like the other study, it was ended early due to safety concerns. Those in the therapy group had more adverse events than those in the placebo group. Those who received the treatment had significantly higher GFR (a good thing) but experienced higher rates of heart failure, nonfatal stroke, hospitalization for heart failure, and higher death rate from cardiovascular causes.

Off-label use of abatacept

Abatacept (sold under the trade name Orencia) is a protein that inhibits a molecule called B7-1 that activates T cells. It is approved for the treatment of rheumatoid arthritis. It is also in clinical trials for the treatment of multiple sclerosis, type 1 diabetes, and lupus. These are all autoimmune diseases.

Chih-Chuan Yu of Harvard Medical School and coauthors noted elevated levels of B7-1 in certain patients with proteinuric kidney disease,  including primary focal segmental glomerulosclerosis (FSGS). They conducted a series of in vitro studies (laboratory experiments) to show that abatacept would block the migration of podocytes (a type of kidney cell). They then recruited patients whose FSGS who did not respond to standard treatments. They selected four kidney transplant patients with rituximab-resistant recurrent FSGS and one patient with glucocorticoid-resistant primary FSGS. They treated all five with abatacept and all five patients experienced remission.

APOL1 risk variants

APOL1 is the gene that encodes the apolipoprotein L1, a component of HDL, also called good cholesterol. Although the exact purpose of APOL1 is not known, we do know that certain variants of APOL1, called G1 and G2, circulating in plasma can suppress Trypanosoma brucei, the parasite that causes sleeping sickness. We also know that these variants are associated with ESRD, though the mechanism isn’t known.

We know that the G1 and G2 variants are more common among African-Americans than in white/Caucasians. And we know that African-Americans have between 3 to 5 times the risk of ESRD than white/Caucasians even though the prevalence of earlier stages of kidney disease are roughly equal for both racial groups. Thus, the question is whether these variants of APOL1 are responsible for some of the difference between rates of ESRD among blacks and whites.

A paper by Afshin Parsa, et al., attempts to answer that question by looking at data from two studies, one called the African-American Study of Kidney Disease and Hypertension (AASK) and the other called Chronic Renal Insufficiency Cohort (CRIC). They find direct evidence that “APOL1 high-risk variants are associated with increased disease progression over the long-term.”

Data for the AASK patient group are shown in the table below. Some items to notice:

  1. There is very little difference in CKD incidence between the patients with no copies of the APOL1 risk variants and those with one copy. This indicates that the trait is recessive
  2. Even patients with no copies of the risk variants have high rates of CKD. This indicates there are more factors left to be discovered
  3. There is a high prevalence (23%) of patients with 2 copies of the risk variants within the African-American population
  4. The risk variants may explain only about 5% (= 23% * (58% – 35%)) of the difference in the incidence rate of ESRD between blacks and whites. Further, the association does not explain the cause of kidney disease in patients with two copies of the risk variants. It does however seems to rule out hypertension and diabetes, since the study controlled for these factors
All patients
Col %
CKD at end*
Col %

Row %
No copies of APOL1 risk variants 234   34%   83   29% 35%
1 copy of APOL1 risk variants 299   43% 112   39% 37%
2 copies of APOL1 risk variants 160   23%   93   35% 58%
TOTAL 693 100% 288 100% 42%

*Number with ESRD or doubling of serum creatinine by end of study

Conclusions

All four papers described above were the subject of editorials in this week’s issue of New Engl J Med. One written by Dr. Zeeuw, the lead author of the  bardoxolone methyl paper, points out that the failure of ACE inhibitor ARB therapies may indicate that “improvement in surrogate markers — lower blood pressure or less albuminuria — does not translate into risk reduction.” In fact he writes that it may go further and the use of these two measures as risk markers for “as therapeutic targets in our patients with type 2 diabetes” may be in doubt. He also promotes the use of “enrichment design” to select patients who are less likely to display an adverse event.

Another editorial by Jonathan Himmelfarb and Katherine Tuttle of the Univ. Washington School of Medicine (and the Kidney Research Institute) make three recommendations to improve the safety and likelihood of success for clinical trials. First, all researchers should make more preclinical data available so that others can conduct better preclinical analysis. Second, researchers should consider the possible off-target effects of a proposed agent and collect data before starting clinical trials. The development of organ on a chip may greatly help this. Finally, researchers should exercise caution whenever a drug has known side effects, for instance when a “drug for diabetic kidney disease increases, rather than decreases, the degree of albuminuria.”

In a third editorial, Börje Haraldsson of the Univ. Gothenburg says the work of Dr. Yu and his colleagues “may signal the start of a new era in the treatment of patients with proteinuric kidney disease.” Let us hope that is true. As we discover more about how the immune system works, how it interacts with its cellular and microbial environment, and how it can be modulated, treatment of many chronic conditions, cancer, and even old age may be affected.

by George Taniwaki

In a Dec 2011 blog post, I critiqued an article in The Fiscal Times that compared the cost of eating a meal at home against dining out at a restaurant. The article purported to show that eating at a restaurant was cheaper. I pointed out the errors in the analysis.

One of the errors was in the way data for expenditures for grocers and restaurants were shown in a line graph. The two lines were at different scales and aligned to different baselines making comparisons difficult. The original and corrected charts are shown below. Correcting the baseline makes it clear that restaurant expenditures are significantly lower than for groceries. Correcting the scale shows that restaurant expenditures are not significantly more volatile than for groceries.

FoodShareAllFoodShareCorrected

Figures 1a and 1b. Original chart (left) and corrected version (right)

Another error in the article I pointed out was that the lower inflation rate of meals at restaurants compared to meals at home should not favor eating more meals at restaurants. I didn’t give an explanation why. I will do so here.

Consider an office worker who needs to decide today whether to make a sandwich for lunch or to buy a hamburger at a restaurant. Let’s say she knows that the price of bread and lunch meat has doubled over the past year (100% inflation rate) while the cost of a hamburger has not changed (0% inflation rate). Which should she buy?

The answer is, she doesn’t have enough information to decide. The inflation rate over the past year is irrelevant to her decision today, or at least it should be. What is relevant is the actual costs and utilities today.

Let’s say she likes shopping, making sandwiches, and cleaning up, so the opportunity cost for the sandwich option is zero. Let’s also say she likes sandwiches and hamburgers equally and values them equally and doesn’t value variety. Now, if the price today for lunch meat and bread for a single sandwich is 50 cents while a hamburger is 75 cents, then she should make a sandwich. Next year if inflation continues as before, making a sandwich will cost $1.00 while a hamburger remains 75 cents. In that case, she should buy a hamburger. But that decision is in the future.

Let’s consider an extreme case where inflation rates may affect purchase decisions today. What if the price of sandwich fixings are 50 cents today but inflation is expected to be 100% during the work week (so prices will be $.57, $.66, $.76, and $.87 over the next four days) . Such high inflation rates are called hyperinflation and can lead to severe economic distortions.

Let’s also assume hamburgers are 75 cents today and will remain fixed at that price by law. (Arbitrary but stringent price controls are another common feature of an economy experiencing hyperinflation.) Further, let’s assume that sandwich fixings can be stored in the refrigerator for a week for future use but hamburgers cannot be bought and stored for future consumption.

Finally, let’s assume it is early Monday and our office worker has no sandwich fixings or hamburgers but has $5 available for lunches for the upcoming week. What should she buy each day?

I would recommend trying to buy $3.75 in sandwich fixings today (enough for 5 sandwiches). Here’s why. During a period of hyperinflation, you want to get rid of money as fast as possible because cash loses its value every day you hold it. Thus, buying as much food as possible today is a good investment (called a price hedge).

Ah, you say. Why not make sandwiches the first two days of the week and then switch to the relatively cheaper hamburgers for the last three days? That is unlikely to work because the restaurant is caught between paying rising prices for the food it buys while getting a fixed price for what it sells. Long lines will form as customers seek cheap food. The restaurant will either run out of food, go bankrupt, or close its doors. Regardless, our office worker shouldn’t rely on her ability to buy cheap hamburgers later in the week.

****
So why am I updating a blog post from almost two years ago? Well, it’s because I noticed a big spike in traffic  landing on it last week. It turns out my wife, Susan Wolcott, assigned it as a reading for a class she is teaching to undergraduate business students at Aalto University School of Business, in Mikkeli, Finland. (The school was formerly known as the Helsinki School of Economics.)

Normally, this blog receives about 30 page views a day. On days that I post an entry on kidney disease or organ donation (pet topics of mine) traffic goes up. Of a typical day’s 30 hits, I presume about half of that traffic is not human. It is from web crawlers looking for sites to send spam to (I receive about 15 spam comments a day on this blog).

But check out the big spike in page views for my blog on the day my wife assigned the reading. This blog received 264 page views from 91 unique visitors. That’s the kind of traffic social media experts die for. Maybe I’ve hit upon an idea for generating lots of traffic to a website, convince college professors to assign it as required reading for a class.

WebTraffic

Figure 2. Web traffic statistics for this blog

Naturally, I expect another big spike of traffic again today when my wife tells her students about this new blog post.

by George Taniwaki

I recently came across a fascinating article with the novel claim that life is older than the earth. The argument is based on a regression analysis that estimates the rate of increase in the maximum genomic complexity of living organisms over time. (Note, this argument only makes sense if you believe that the universe can be more than 5,000 years old, that genes mutate randomly and at a constant rate, that genetic changes are inherited and accumulate, and that mathematics can be used to explain how things work. Each of those assumptions can be argued separately, but that is beyond the scope of this blog post.)

The article entitled Life Before Earth was posted on arxiv.org in Mar 2013. The authors are Alexei Sharov a staff scientist at the National Institute on Aging and Richard Gordon, a theoretical biologist at the Gulf Specimen Marine Laboratory.

They draw a log-linear plot of the number of base pairs in a selection of terrestrial organisms against the time when the organism first appeared on earth (see Fig 1). For instance, the simplest bacteria, called prokaryotes, have between 500,000 to 6 million base pairs in their genome and are believed to have first appeared on earth about 3.5 billion years ago. At the other extreme, mammals including humans, have between about 2.7 billion to 3.2 billion base pairs in their genome. The fossil record indicates the first mammals appeared about 225 million years ago, during the Triassic period. All other known organisms can be plotted on these axes and the trend appear linear, meaning the growth in genome complexity is nearly exponential.

Extrapolating the data back in time, one can estimate when the maximum complexity was only one base pair. That is, the date when the first protoörganisms formed. The trend line indicates this occurred 9.7 billion years ago, or about 4 billion years after the big bang.

Arxiv

Figure 1. The growth in maximum base pair count per genome seems to grow exponentially over time. Image from arxiv.org

The earth is estimated to be only 4.5 billion years old. Thus, if these results are accepted, the implications are pretty astounding.

1. Life did not start on earth. It started somewhere else in the galaxy and reached the microörganism stage. Through some cosmic explosion, the life was transported here. Alternatively, life started on one of the planets around the star that went supernova before collapsing and forming our present-day sun. This hypothesis is called exogenesis

2. It is unlikely that these alien microörganisms only landed on earth as our solar system formed. They probably coated every asteroid, comet, planet, and moon in our solar system. They may still be alive in many locations and are either evolving or dormant

3. If all of the microörganisms that reached our solar system came from the same source, they likely have the same genetic structure. That is, if we find life elsewhere in our solar system, it is likely to contain right-handed double helix of DNA using the same four left-handed amino acid base pairs as life on earth. With effort, we could construct an evolutionary tree that contains these organisms

4. In fact, the same microörganisms may be very common throughout the galaxy, meaning life has arrived on many other planets, or perhaps every planet in our galaxy, even ones with no star, a hypothesis called panspermia

5. It solves Fermi’s paradox. In 1950, Enrico Fermi noted that our Sun is a young star and that there are billions of stars in our galaxy billions of years older. Why haven’t we already been contacted by intelligent life from other planets? The answer based on this analysis is because intelligent life takes 9.7 billion years to form and we may be one of the first organisms to reach the level of intelligence necessary to achieve intentional interstellar communication and travel. Ironically, if Sharov and Gordon are right, unintentional interstellar travel is already quite common and has been for billions of years.

Relationship to Moore’s Law

If the exponential growth of complexity shown in Figure 1 above looks familiar, it is because it is the same  shape as the increase in the number of transistors in microprocessor chips over time, a relationship called Moore’s Law. The authors cite this analogy as a possible argument in favor of their case.

MooresLaw

Figure 2. The growth in maximum transistor count per processor has grown exponentially over time. Image from Wikipedia

Is this reasonable?

Like I said, this is a fascinating article. But it is all speculation. We have no direct evidence of many of the claims and inferences made in this paper. Specifically, we don’t know:

  1. The exact size of the genome that various organisms had in the earth’s past
  2. The nature of possible organisms less complex than prokaryotes
  3. The existence of any alien microörganisms or evidence that any landed on early earth
  4. The speed of genetic complexity changes in the early earth environment, or on other planetary environments in the case of the alien microörganisms prior to arrival on earth
  5. Whether any modern earth organisms, or any potential alien microörganisms, could withstand the rigors of travel through space for the millions of years it would take to get to earth from another star system

Finally, we have no clear explanation why the rate of change in genome complexity should be exponential. The use of the Moore’s Law chart to show that exponential growth in complexity is reasonable is slightly disingenuous. Moore’s Law is generally used as a forecast for the future growth in complexity for a commercial product based on historical data. Further, the forecast is used to estimate product demand, research costs, and necessary production investment, all of which tends to drive advancements and make the prediction self-fulfilling.

On the other hand, genome complexity is not directed. Evolution is a random process that will generate greater complexity only if a new, more complex organism can take advantage of an ecological niche that cannot be exploited by simpler organisms. Nothing is driving greater genome complexity.

Anyway, this is a very controversial theory. But I believe it may lead to new insights regarding microbiology, astrobiology, molecular clock hypothesis, and the use of mathematical models in evolutionary biology.

****

How long is 13.7 billion years?

As a side note, sometimes we wonder, how could something as complex as DNA and cellular organisms have started from nothing? It seems impossible to comprehend. But if you take a look at Figure 1, you will see that it may have taken over six billion years for a single base pair of DNA to grow in complexity to form an organism that we think of as a simple, primitive prokaryote. Then it took another 3.5 billion years before mammals appeared. Then it only took 200 million more years before our human ancestors appeared. And finally only a few hundred thousand years passed before you and I were born.

To give you a feel for how long 13.7 billion years is, watch this extremely boring YouTube video that compresses each billion years into one minute increments.

AgeOfUniverse

Figure 3. Age of Universe, boring, even with music. Video still from Craig Hall

****

A final thought to close this blog post, there may be aliens living on earth, but don’t be afraid, because it’s us.

by George Taniwaki

A graphic in Bloomberg Businessweek Mar 2013 (reproduced below) lists the four metro areas with the greatest economic growth over the five-year period 2007-2011. It also gives their population change during the same period. And it lists the four cities that had both negative population growth and GDP growth during the same period.

Business Week

Figure 1. Ranking 8 cities by total growth. Image from Bloomberg Businessweek

This chart is a bit light on data, containing only 16 data points. And the changes to population and GDP are not directly comparable since the population change is reported cumulatively for the four years (total number of years minus one) while GDP is annualized. Let’s calculate the cumulative GDP change as follows:

Total change GDP = (1 + Annual change GDP)^Years – 1.

Also, notice that the data has differing numbers of significant digits. The annualized GDP changes are displayed with two digits. The population changes show one, except for Chicago and Providence which have several. I’m sure this was done to show that the populations of these two cities were falling rather than flat. Let’s get rid of those extra digits.

This chart ranks the best and worst performing metro areas. One could reasonably argue that the metro areas with the greatest absolute GDP growth are the best. (I will argue otherwise shortly.) But should the worst performing areas be defined as the four that had both declining population and declining GDP? For a counterexample, consider a city where the population is growing but GDP is falling. I would say it is actually in worse shape based on the negative value of its per capita GDP growth. In fact, any city where the population is growing faster than GDP (or shrinking slower than GDP) would have negative GDP growth. Perhaps GDP per capita is a better measure of performance than total GDP change.

To address this, let’s calculate the change in GDP per capita as follows:

Change GDP per capita = ((Change in GDP + 1) / (Change in pop. +1)) –1.

The normalized data from the chart is summarized in the table below.


Metro area
Total change in pop Total change in GDP Total change in GDP per capita
Portland   4%   22%   18%
San Jose   3%   19%   15%
Austin 12%   14%      2%
New Orleans 16%     8%    -7%
Detroit -4% -19%   -16%
Cleveland -1%   -5%     -4%
Chicago -0%   -2%     -2%
Providence -0%   -1%     -1%

Notice now that New Orleans, despite having very high GDP growth, has large negative per capita GDP growth because its population is growing faster than its GDP. Austin’s performance now looks less impressive too. And while Cleveland, Chicago, and Providence all have negative per capita GDP growth, they are not doing as badly as it first appears.

Even when normalized, the data in the table is still lacking context. It doesn’t give the reader a feel for the big picture. For instance, how many metropolitan areas over 1 million are there in the U.S.?  What is the average population change and GDP change among those cities? Which cities had the greatest change, either positive or negative, in population, GDP, and per capita GDP?

Continuing that analysis, we would want to know if most cities were growing near the average rate or if there is a large dispersion. What is the shape of this dispersion? Are there geographic location, city size, or other factors that correlate with growth? Finally, are there time series trends? To answer these questions we need to go back to the source data and create our own charts.

Creating the metro population and GDP dataset

The footnote to the Bloomberg Businessweek chart says the data is from the Bureau of Economic Analysis and the Census Bureau. The BEA GDP data is available from an interactive website. I selected Table = GDP by metro area, Industry = All industry total, Area = All MSAs, Measures = Levels, and Year = 2007 to 2011.

The Census Bureau population estimates for the metropolitan statistical areas (MSAs) are available for download from the Census website. I downloaded the historical decennial data for 2000 to 2009 and the current decennial data that covers 2010 to 2012. I merged the three data sets keyed off of Census Bureau statistical area (CBSA) code.

Note that several of the CBSAs changed in 2010, meaning the code changed too. The most significant is that Los Angeles-Long Beach-Santa Ana, CA (31100) changed to Los Angeles-Long Beach-Anaheim, CA (31080).

In addition to the MSA records, I created two additional records. One contains the total population and GDP for all MSAs and the other for MSAs with population greater than one million.

Since the geographic names of the MSAs are often quite long, I want to find shorter labels that I can use on a scatterplot. I decide to use airport codes. These are short, unique, cover any big city with an airport worldwide, and if you travel a lot, you’ve possibly memorized quite a few, so you don’t need a legend to decode them. I append this to each record.

Finally, I calculate the following descriptive statistics for each MSA and append them to the records:

Change in population = (Pop on Jul 2011 / Pop on Jul 2007) – 1

Change in GDP = (GDP for 2011 / GDP for 2007) – 1

Per capita GDP for year 20xx = GDP for year 20xx / Pop on Jul of year 20xx

Change in GDP per capita = (Per capita GDP for 2011 / GDP per capita for 2007) – 1

An Excel spreadsheet containing the original data, the merged table, and the scatterplot is available on SkyDrive.

Interactive data available on SkyDrive

Comparing the data I collected with the Bloomberg Businessweek data, the ranking for the top four cities match, but the values for population change and GDP change do not. This could be because different data was used (historical population and GDP estimates are revised annually).

The data for the bottom four cities don’t match at all. The data I collected shows only one city that had falling population and GDP during the time period, Detroit. The three other cities showed rising GDP and two showed rising population as well. And despite the falling population and GDP, all four cities showed rising GDP per capita.

The data for those 8 metro areas plus a few outliers are shown below The means for all 51 MSAs with population greater than one million are included for comparison.


Metro area
Total change in pop Total change in GDP Total change in GDP per capita
Portland   4.5% 23.1% 17.8%
San Jose   5.0% 28.6% 12.9%
Austin 11.7% 18.4%   6.0%
New Orleans   9.4% 17.5%   7.4%
Salt Lake City   1.3% 14.9% 13.4%
Mean   4.2%   6.9%   2.6%
Detroit -3.8% -2.4%   1.4%
Cleveland -1.5%   3.5%   5.0%
Chicago   0.5%   5.4%   4.9%
Providence   0.0%   6.7%   6.6%
Las Vegas   7.0% -5.9% -12.1%
Charlotte 36.7%   9.3% -20.1%

Visualization and Analysis

I generated a simple scatterplot of change in GDP against change in population for all 51 MSAs. The cities from the table above are highlighted in green and red. I added a population weighted trend line shown in brown. The trend line passes through the mean (4.2%, 6.9%) and has an y-intercept at 2.6%.

I could have made the chart fancy by adding information using the size, shape and color of the markers. For instance I could change the size of the markers based on the population of the MSA, change the shape of the marker based on whether the city was coastal or inland, and change the color of the marker based on which Census region it was in.

GdpPop

Figure 2. A simple scatterplot showing the top 51 MSAs. Image by George Taniwaki

The four metro areas with the highest GDP growth are all above the trend line and have high per capita GDP growth. However, the Bloomberg Businessweek chart leaves off Salt Lake City which has lower GDP growth but because its population only grew 1.3% during the period, its per capita GDP growth is a very high 13.4%.

The fastest shrinking metro area is Detroit, which matches the Bloomberg Businessweek result. Note that its lies above the 45-degree diagonal running through the origin, meaning its GDP decline is less than its population decline and so has positive GDP per capita growth. However, it is still below the trend line, meaning it is growing slower than the average.

The other three metro areas in the Bloomberg Businessweek chart, Cleveland, Chicago, and Providence, all show slow or negative population growth, but are all above the trend line. They probably should not be considered bust-towns. The true bust-town in the scatterplot is Las Vegas. It is an outlier with a population growth of 7% but a GDP decline of 6% which results in a 12% drop in GDP per capita.

The final outlier is Charlotte. It shows a population gain of nearly 37% which is more than double of the next fastest growing city. But it has only a 9% increase in GDP leaving it with a 20% drop in GDP per capita. This is a sign that rapid growth can actually be very bad for a city.

Data error and bias

The statement in the paragraph above assumes that the data for change in economic activity and in population at the MSA level developed by two separate organizations for separate reasons is accurate and comparable. Neither of these assumptions is particularly sound. Specifically, there is a big discontinuity in the population estimate for Charlotte between Jul 2009 (the last estimate based on the 2000 census) and Jul 2010 (the first estimate based on the 2010 census) that accounts for most of the population gain. Thus, the annual population estimates may need to be smoothed before calculating the change between years.

I believe the BEA estimate of economic activity for an MSA is based partly on the population estimate for the MSA. Thus, if the population estimate changes (it is revised annually), then the GDP estimate will no longer be valid and will need to be updated.

Finally, you should be careful when combining data from different sources and comparing them. We do it all the time but we have to be conscience of what the consequences are. This is an especially important point since everybody today is rapidly building giant data warehouses and running analytics on data that has never been combined before.