Google rewards reputable reporting, not left-wing politics, from The Economist

by George Taniwaki

A few months ago The Economist added a new feature to its back section called Graphic detail. It’s a pleasure to read because it nearly always contains bivariate plots where the x-axis is something more interesting than the date.

This week’s entry does not disappoint. It is entitled, Seek and you shall find and contains two charts (see above) with interesting x-axes. The charts analyze the impact of Google news search on the traffic a news source receives. It uses two independent measures, Accuracy score and Ideology score to rate different news sources. Accuracy and bias were determined using data from and

Many people claim that Google favors liberal news sources to the detriment of conservative views. Google claims it has a set of outside reviewers who check news sources for accuracy and reach. Point of view is not considered. However, one could imagine that a news source that has a strong point of view may report facts to match a point of view and that would reduce accuracy. As can be seen on the chart on the left, news sources with a strong ideological bias (darker red and blue dots) tend to have lower accuracy scores than less biased sources. I encourage you to go to the website because the data is interactive.

The dependent y-axis is the share of web traffic that comes from search engines. This is a bit problematic since if users believe that Google’s results are biased against their favorite news sources, they will visit it directly without using a search engine. Nonetheless, the data shows that search engine (mostly Google) share of web traffic increases with accuracy, not with ideology. That is, the plot on the left shows a linear relationship while the right plot does not.

Expected v. Actual

A separate experiment confirms the results. The Economist built a model to predict the number of news results appearing in 37 publications should receive from Google’s search engine based on their accuracy and their reach. It then compared the model results to actual search results on a “clean” computer using “a browser with no history, in a politically centrist part of Kansas.” (Why Kansas, you wonder? I’m guessing that is where the author lives.)


No bias detected, from The Economist

Again, no bias was detected. The difference between left and right are small and could be due to how they are defined, time of study, keywords searched, or other factors. The story is an excellent example of combining data from multiple sources, programming a bot to collect data, and visual display of statistical analysis.


by George Taniwaki

I have now completed all ten classes required to receive the Microsoft Data Science Certificate. As a newly minted data scientist, I am ready to dig into large datasets and make incredible (and if I am not careful, potentially unverifiable) predictions. A description of the four I took during the past two quarters is shown below. (For a list of the first six classes I took, see this Jul 2017 blog post.)

DAT209x – Programming with R for Data Science

This class is a continuation of DAT204x. Topics covered include functions and data structures, loops and flow control, working with vectors and matrices, reading and writing data files, reading SQL databases, manipulating data (i.e., merging, subsetting, filtering, introduction to grep and text functions, using date and time functions, aggregating, grouping, and summarizing), simulation, linear models, and graphics.

The final exam for this course used the same DataCamp-based timed format as in DAT204x. I didn’t do well on that test, but felt confident going in this time because I was prepared for it. Unfortunately, I ended up failing the test because I could not answer the first question and then could not navigate past it. After spending over half an hour trying to resolve the issue and contacting tech support (which promised a reply within 24 hours but never responded), I gave up. How annoying. Luckily, it didn’t matter since I already had enough points to pass the class. If I was susceptible to test anxiety, this would have been a traumatic experience.

Time: 12 hours for 12 modules

Score: 100% on all exercises (not graded) and labs. Missed 1 quiz question because my instance of the ggplot2 library behaved differently than the one used in class. Got zero on the final exam (see above) for a combined score of 79%

DAT209x Score   DAT209x Certificate

DAT203.2x – Principles of Machine Learning

This class is a continuation of DAT203.1x. It covers the the theory and application (using Microsoft Azure Machine Learning) of popular classification models including logistic regression, boosted decision trees, neural networks, and support vector machines (SVM). The models are tuned (that is, optimized for accuracy on the data outside the training set) using permutation, regularization, and cross-validation, which are all ensemble learning methods.

The class covers the two most popular continuous models, regression and random forest. Beyond prediction models, the class covers K-mean clusters and the matchbox recommender system developed by Microsoft Research.

Similar to my experience with DAT203.1x, I am really impressed with the power of MAML but a bit disappointed in the cookbook nature of the labs in the class.

The final exam was an assignment to create a prediction model for the number of minutes difference between the actual arrival time and the scheduled arrival time for commercial airline flights. The grade was based on quality of the predictions for 25 flights outside the training set. The final was graded with one point for each prediction that was within 10 minutes of the actual arrival time.

My model had a mean absolute error of about 8.5 (a bit high since the goal is to get within 10 minutes), but I got 22 out of 25 predictions within the allowed range (88%). I am guessing that with more engineering effort I could reduce my error size. For instance, I could have created a categorical variable for gate load that segmented flight arrivals into weekday morning (a busy time), weekday midday, weekday evening (also busy), weekend day, any night, and heavy traffic days before and after major holidays.

But I don’t think those improvements would help with the 3 cases that my model got wrong. They appear to be outliers that could not be predicted using the variables at hand. A better prediction would need to include variables that were not available in the dataset like weather at the arrival airport, airport construction status, and if any landing restrictions were in effect at the airport that day.

Time: 12 hours for 6 modules

Score: 100% on the 6 labs, 88% accuracy in final model, and 100% on 2 surveys, for a combined score of 95%

DAT203.2x Score     DAT203.2x Certificate

DAT203.3x – Applied Machine Learning

This class is a continuation of DAT203.2x. It consists of four distinct modules. The first introduces time series analysis with emphasis on seasonal decomposition of time series using LOESS (STL) and transforming the data into a stationary process using autoreressive integrated moving average (ARIMA).

Next, the class covers spatial data analysis with interpolation using kernel density estimation and K-nearest neighbor. The data is modeled using spatial Poisson process, variograms, and Kriging. The resulting output can be displayed using dot maps, bubble maps, heat maps, and chloropleth maps. The analysis is done using R in both a Jupyter notebook in MAML and as a stored procedure in SQL Server 2016 R services and R script in Power BI .

The third module covers text analysis. English text is processed using text normalization, removing stop words, and stemming. These methods may not be applicable in other languages or scripts. Once the text is clean, the text can be analyzed for word frequency, word importance, named entity recognition, text mining, and sentiment analysis. All of these techniques are used in natural language processing.

The final module introduces image processing and image analysis using Python routines in matplotlib. Example techniques include denoising using convolution with Gaussian blur or median filter and prewhitening using Gaussian noise. Python code is used to show how to resize and rotate images. Feature extraction is demonstrated using Sobel edge detection, segmentation, and Harris corner detection. The basic image morphology operators are introduced, including dilation, eroding, opening, and closing. The course also introduces the cognitive services APIs available in Azure portal and how to access them using Python and C#.

Time: 12 hours for 4 modules

Score: 100% on the 4 labs and missed one question on the final exam for a combined score of 95%

DAT203.3x Score     DAT203.3x Certificate

DAT102x – Data Science Professional Capstone

This is the tenth and final course needed to receive the Microsoft Data Science Certificate. The class lasts one month and consists of a machine learning contest and report. The project changes every quarter. For a description of the April contest, see the 3 part blog entry that starts  June 2018.

Time: 10 hours for creating a machine learning model and writing the report

Score: Missed one question in the data analysis section (because the mean and standard deviation had to be reported to six decimal places and I only entered them to 2 decimals), scored 86% on my machine learning model, and 100% on my report graded by 3 class peers, for a final score of 91%

DAT102x Score     DAT102x Certificate

Final Certificate

And finally, below is the Data Science Certificate.



by George Taniwaki

Big data and machine learning are all the rage now. Articles in the popular press inform us that anyone who can master the skills needed to turn giant piles of previously unexplored data into golden nuggets of business insight can write their own ticket to a fun and remunerative career (efinancialcareers May 2017).

Conversely, the press also tells us that if we don’t learn these skills a computer will take our job (USA Today Mar 2014). I will have a lot more to say about changes in employment and income during the industrial revolution in future blog posts.

But how do you learn to become a data scientist. And which software stack should one specialize in? There are many tools to choose from. Since I live in the Seattle area and do a lot of work for Microsoft, I decided to do take an online class developed and sponsored by Microsoft and edX. Completion of the course leads to a Microsoft Data Science Certificate.

The program consists of 10 courses with some choices, like conducting analysis using either Excel or Power BI, and programming using either R or Python. Other parts of the Microsoft stack you will learn include SQL Server for queries and Microsoft Azure Machine Learning (MAML) for analysis and visualization of results. The courses are priced about $99 each. You can audit them for free if you don’t care about the certificates.

I started the program in February and am about half way done. In case any clients or potential employers are interested in my credentials, my progress is shown below.

DAT101x – Data Science Orientation

If you haven’t been in college in a while or have never taken an online class, this is a good introduction to online learning. The homework consists of some simple statistics and visualization problems.

Time: 3 hours for 3 modules

Score: 100% on 3 assignments

DAT101x Score    DAT101x Certificate

DAT201x – Querying with Transact-SQL

I took a t-SQL class online at Bellevue College two years ago. Taking a class with a real teacher, even one you never meet, was a significantly better experience than a self-paced mooc. This course starts with the basics like select, subqueries, and variables. It also covers intermediate topics like programming, expressions, stored procedures, and error handling. I did my homework using both a local instance of SQL Server and on an Azure SQL database.

Time: 20 hours for 11 modules

Score: I missed one question in the homework and two in the final exam for a combined score of 94%

DAT201x Score     DAT201x Certificate

DAT207x – Analyzing and Visualizing Data with Power BI

I already have experience creating reports using Power BI. I also use Power Query (now called get and transform data) and M language and Power Pivot and DAX language, so this was an easy class.

The course covers data transforms, modeling, visualization, Power BI web service, organization packs, security and groups. It also touches on the developer API and building mobile apps.

Time: 12 hours for 9 modules

Score: I missed one lab question for a combined score of 98%

DAT207x Score     DAT207x Certificate

DAT222x – Essential Statistics for Data Analysis using Excel

This class is comprehensive and covers all the standard statistics and probability topics including descriptive statistics, Bayes rule, random variables, central limit theorem, sampling and confidence interval, and hypothesis testing. Most analysis is conducted using the Data analysis pack add-in for Excel.

Time: I used to work in market research, so I know my statistics. However, there are 36 homework assignments and it took me over 20 hours to complete the 5 modules.

Score: I missed 9 questions on the quizzes (88%) and six in the final exam (81%) for a combined score of 86%. (Despite the time it takes to complete, homework counts very little toward the final grade)

DAT222x Score     DAT222x Certificate

DAT204x – Introduction to R for Data Science

Now we are getting into the meat of the program. R is a functional language. In many ways it is similar to the M language used in Power Query. I was able to quickly learn the syntax and grasp the core concepts.

The course covers vectors, matrices, factors, lists, data frames, and simple graphics.

The lab assignments use DataCamp which has a script window where you write code and a console window that displays results. That makes it easy to debug programs as you write them.

The final exam used an unexpected format. It was timed and consisted of about 50 questions, mostly fill-in-the-blank responses that include code snippets. You are given 4 minutes per question. If you don’t answer within the time limit, it goes to the next question. I completed the test in about 70 minutes, but I ran out of time on several questions, and was exhausted at the end. I’m not convinced that a timed test is the best way to measure subject mastery by a beginning programmer. But maybe that is just rationalization on my part.

Time: 15 hours for 7 modules

Score: I got all the exercises (ungraded) and labs right and missed two questions in the quizzes. I only got 74% on the final, for a combined score of 88%

DAT204x Score     DAT204x Certificate

DAT203.1x Data Science Essentials

The first three modules in this course covered statistics and was mostly a repeat of the material introduced in DAT222x. But the rest of the course provides an excellent introduction to machine learning. You learn how to create a MAML instance, import a SQL query, manipulate it using R or Python, create a model, score it, publish it as a web service, and use the web service to append predictions as a column in Excel. I really like MAML. I will post a review of my experience in a future blog post.

The course was a little too cookbook-like for my taste. It consisted mostly of following directions to drag-drop boxes onto the canvas UI and copy-paste code snippets into the panels. However, if you want a quick introduction to machine learning without having to dig into the details of SQL, R, or Python, this is a great course.

Time: 10 hours for 6 modules

Score: 100% on the 6 labs and the final

DAT203.1x Score     DAT203.1x Certificate

I have now completed six out of the ten courses required for a certificate. I expect to finish the remaining 4 needed for a certificate by the end of the year. I will also probably take some of the other elective courses simply to learn more about Microsoft’s other machine learning and cloud services.

For my results in the remaining classes, see Microsoft Data Science Certificate-Part 2

Update: Modified the description of the final exam for DAT204x.

by George Taniwaki

Did you watch the debate on Monday night? I did. But I am also very interested in the post-debate media coverage and analysis. This morning, two articles that combine big data and the debate caught my eye. Both are novel and much more interesting than the tired stories that simply show changes in polls after a debate.

First, the New York Time reports that during the presidential debate (between 9:00 and 10:30 PM EDT) there is high correlation between the Betfair prediction market for who will win the presidential election and afterhours S&P 500 futures prices (see chart 1).


Chart 1. Betfair prediction market for Mrs. Clinton compared to S&P 500 futures. Courtesy of New York Times

Correlation between markets is not a new phenomena. For several decades financial analysts have measured the covariance between commodity prices, especially crude oil, and equity indices. But this is the first time I have seen an article illustrating the covariance between a “fun” market for guessing who will become president against a “real” market. Check out the two graphs above, the similarity in shape is striking, including the fact that both continue to rise for about an hour after the debate ended.

In real-time, while the debate was being broadcast, players on Betfair believed the chance Mrs. Clinton will win the election rose by 5 percent. Meanwhile, the price of S&P 500 futures rose by 0.6%, meaning investors (who may be the same speculators who play on Betfair) believed the stock market prices in November were likely to be higher than before the debates started. There was no other surprise economic news that evening, so the debate is the most likely explanation for the surge. Pretty cool.

If the two markets are perfectly correlated (they aren’t) and markets are perfectly efficient (they aren’t), then one can estimate the difference in equity futures market value between the two candidates. If a 5% decrease in likelihood of a Trump win translates to a 0.6% increase in equity futures values, then the difference between Mr. Trump or Mrs. Clinton being elected (a 100% change in probability) results in about a 12% or $1.2 trillion (the total market cap of the S&P 500 is about $10 trillion) change in market value. (Note that I assume perfect correlation between the S&P 500 futures market and the actual market for the stocks used to calculate the index.)

Further, nearly all capital assets (stocks, bonds, commodities, real estate) in the US are now highly correlated. So the total difference is about $24 trillion (assuming total assets in the US are $200 trillion). Ironically, this probably means Donald Trump would be financially better off if he were to lose the election.


The other article that caught my eye involves Google Trend data. According to the Washington Post, the phrase “registrarse para votar” was the third highest trending search term the day after the debate was broadcast. The number of searches is about four times higher than in the days prior to the debates (see chart 2). Notice the spike in searches matches a spike in Sep 2012 after the first Obama-Romney debate.

The article says that it is not clear if it was the debate itself that caused the increase or the fact that Google recently introduced Spanish-language voting guides to its automated Knowledge Box, which presumably led to more searches for “registrarse para votar”. (This is the problem with confounding events.)

After a bit of research, I discovered an even more interesting fact. The spike in searches did not stop on Sep 27. Today, on Sep 30, four days after the debates, the volume of searches is 10 times higher than on Sep 27, or a total of 40x higher than before the debate (see chart 3). The two charts are scaled to make the data comparable.


Chart 2. Searches for “registrarse para votar” past 5 years to Sep 27. Courtesy of Washington Post and Google Trends


Chart 3. Searches for “registrarse para votar” past 5 years to Sep 30. Courtesy of Google Trends

I wanted to see if the spike was due to the debate or due to the addition of Spanish voter information to the Knowledge Box. To do this, I compared “registrarse para votar” to “register to vote”. The red line in chart 4 shows Google Trend data for “register to vote” scaled so that the bump in Sept 2012 is the same height as in the charts above. I’d say the debate really had an unprecedented effect on interest in voting and the effect was probably bigger for Spanish speaking web users.


Chart 4. Searches for “register to vote” past 5 years to Sep 30. Courtesy of Google Trends

Finally, I wanted to see how the search requests were distributed geographically. The key here is that most Hispanic communities vote Democratic and many states with a large Hispanic population are already blue (such as California, Washington, New Mexico, New Jersey, and New York). The exception is Florida with a large population of Cuban immigrants who tend to vote Republican.


Chart 5. Searches for “registrarse para votar” past 5 years to Sep 30 by county. Courtesy of Google Trends

If you are a supporter of Democrats like Mrs. Clinton, the good news is that a large number of queries are coming from Arizona, and Texas, two states where changes in demographics are slowly turning voting preferences from red to blue.

In Florida, it is not clear which candidate gains from increased number of Spanish-speaking voters. However, since the increase is a result of the debate (during which it was revealed that Mr. Trump had insulted and berated a beauty pageant winner from Venezuela, calling her “miss housekeeping”), I will speculate many newly registered voters are going to be Clinton supporters.

If the Google search trend continues, it may be driven by new reports that Mr. Trump may have violated the US sanctions forbidding business transactions in Cuba. Cuban-Americans searching for information on voter registration after hearing this story are more likely to favor Mrs. Clinton.

by George Taniwaki

In a Dec 2011 blog post, I critiqued an article in The Fiscal Times that compared the cost of eating a meal at home against dining out at a restaurant. The article purported to show that eating at a restaurant was cheaper. I pointed out the errors in the analysis.

One of the errors was in the way data for expenditures for grocers and restaurants were shown in a line graph. The two lines were at different scales and aligned to different baselines making comparisons difficult. The original and corrected charts are shown below. Correcting the baseline makes it clear that restaurant expenditures are significantly lower than for groceries. Correcting the scale shows that restaurant expenditures are not significantly more volatile than for groceries.


Figures 1a and 1b. Original chart (left) and corrected version (right)

Another error in the article I pointed out was that the lower inflation rate of meals at restaurants compared to meals at home should not favor eating more meals at restaurants. I didn’t give an explanation why. I will do so here.

Consider an office worker who needs to decide today whether to make a sandwich for lunch or to buy a hamburger at a restaurant. Let’s say she knows that the price of bread and lunch meat has doubled over the past year (100% inflation rate) while the cost of a hamburger has not changed (0% inflation rate). Which should she buy?

The answer is, she doesn’t have enough information to decide. The inflation rate over the past year is irrelevant to her decision today, or at least it should be. What is relevant is the actual costs and utilities today.

Let’s say she likes shopping, making sandwiches, and cleaning up, so the opportunity cost for the sandwich option is zero. Let’s also say she likes sandwiches and hamburgers equally and values them equally and doesn’t value variety. Now, if the price today for lunch meat and bread for a single sandwich is 50 cents while a hamburger is 75 cents, then she should make a sandwich. Next year if inflation continues as before, making a sandwich will cost $1.00 while a hamburger remains 75 cents. In that case, she should buy a hamburger. But that decision is in the future.

Let’s consider an extreme case where inflation rates may affect purchase decisions today. What if the price of sandwich fixings are 50 cents today but inflation is expected to be 100% during the work week (so prices will be $.57, $.66, $.76, and $.87 over the next four days) . Such high inflation rates are called hyperinflation and can lead to severe economic distortions.

Let’s also assume hamburgers are 75 cents today and will remain fixed at that price by law. (Arbitrary but stringent price controls are another common feature of an economy experiencing hyperinflation.) Further, let’s assume that sandwich fixings can be stored in the refrigerator for a week for future use but hamburgers cannot be bought and stored for future consumption.

Finally, let’s assume it is early Monday and our office worker has no sandwich fixings or hamburgers but has $5 available for lunches for the upcoming week. What should she buy each day?

I would recommend trying to buy $3.75 in sandwich fixings today (enough for 5 sandwiches). Here’s why. During a period of hyperinflation, you want to get rid of money as fast as possible because cash loses its value every day you hold it. Thus, buying as much food as possible today is a good investment (called a price hedge).

Ah, you say. Why not make sandwiches the first two days of the week and then switch to the relatively cheaper hamburgers for the last three days? That is unlikely to work because the restaurant is caught between paying rising prices for the food it buys while getting a fixed price for what it sells. Long lines will form as customers seek cheap food. The restaurant will either run out of food, go bankrupt, or close its doors. Regardless, our office worker shouldn’t rely on her ability to buy cheap hamburgers later in the week.

So why am I updating a blog post from almost two years ago? Well, it’s because I noticed a big spike in traffic  landing on it last week. It turns out my wife, Susan Wolcott, assigned it as a reading for a class she is teaching to undergraduate business students at Aalto University School of Business, in Mikkeli, Finland. (The school was formerly known as the Helsinki School of Economics.)

Normally, this blog receives about 30 page views a day. On days that I post an entry on kidney disease or organ donation (pet topics of mine) traffic goes up. Of a typical day’s 30 hits, I presume about half of that traffic is not human. It is from web crawlers looking for sites to send spam to (I receive about 15 spam comments a day on this blog).

But check out the big spike in page views for my blog on the day my wife assigned the reading. This blog received 264 page views from 91 unique visitors. That’s the kind of traffic social media experts die for. Maybe I’ve hit upon an idea for generating lots of traffic to a website, convince college professors to assign it as required reading for a class.


Figure 2. Web traffic statistics for this blog

Naturally, I expect another big spike of traffic again today when my wife tells her students about this new blog post.

by George Taniwaki

I recently came across a fascinating article with the novel claim that life is older than the earth. The argument is based on a regression analysis that estimates the rate of increase in the maximum genomic complexity of living organisms over time. (Note, this argument only makes sense if you believe that the universe can be more than 5,000 years old, that genes mutate randomly and at a constant rate, that genetic changes are inherited and accumulate, and that mathematics can be used to explain how things work. Each of those assumptions can be argued separately, but that is beyond the scope of this blog post.)

The article entitled Life Before Earth was posted on in Mar 2013. The authors are Alexei Sharov a staff scientist at the National Institute on Aging and Richard Gordon, a theoretical biologist at the Gulf Specimen Marine Laboratory.

They draw a log-linear plot of the number of base pairs in a selection of terrestrial organisms against the time when the organism first appeared on earth (see Fig 1). For instance, the simplest bacteria, called prokaryotes, have between 500,000 to 6 million base pairs in their genome and are believed to have first appeared on earth about 3.5 billion years ago. At the other extreme, mammals including humans, have between about 2.7 billion to 3.2 billion base pairs in their genome. The fossil record indicates the first mammals appeared about 225 million years ago, during the Triassic period. All other known organisms can be plotted on these axes and the trend appear linear, meaning the growth in genome complexity is nearly exponential.

Extrapolating the data back in time, one can estimate when the maximum complexity was only one base pair. That is, the date when the first protoörganisms formed. The trend line indicates this occurred 9.7 billion years ago, or about 4 billion years after the big bang.


Figure 1. The growth in maximum base pair count per genome seems to grow exponentially over time. Image from

The earth is estimated to be only 4.5 billion years old. Thus, if these results are accepted, the implications are pretty astounding.

1. Life did not start on earth. It started somewhere else in the galaxy and reached the microörganism stage. Through some cosmic explosion, the life was transported here. Alternatively, life started on one of the planets around the star that went supernova before collapsing and forming our present-day sun. This hypothesis is called exogenesis

2. It is unlikely that these alien microörganisms only landed on earth as our solar system formed. They probably coated every asteroid, comet, planet, and moon in our solar system. They may still be alive in many locations and are either evolving or dormant

3. If all of the microörganisms that reached our solar system came from the same source, they likely have the same genetic structure. That is, if we find life elsewhere in our solar system, it is likely to contain right-handed double helix of DNA using the same four left-handed amino acid base pairs as life on earth. With effort, we could construct an evolutionary tree that contains these organisms

4. In fact, the same microörganisms may be very common throughout the galaxy, meaning life has arrived on many other planets, or perhaps every planet in our galaxy, even ones with no star, a hypothesis called panspermia

5. It solves Fermi’s paradox. In 1950, Enrico Fermi noted that our Sun is a young star and that there are billions of stars in our galaxy billions of years older. Why haven’t we already been contacted by intelligent life from other planets? The answer based on this analysis is because intelligent life takes 9.7 billion years to form and we may be one of the first organisms to reach the level of intelligence necessary to achieve intentional interstellar communication and travel. Ironically, if Sharov and Gordon are right, unintentional interstellar travel is already quite common and has been for billions of years.

Relationship to Moore’s Law

If the exponential growth of complexity shown in Figure 1 above looks familiar, it is because it is the same  shape as the increase in the number of transistors in microprocessor chips over time, a relationship called Moore’s Law. The authors cite this analogy as a possible argument in favor of their case.


Figure 2. The growth in maximum transistor count per processor has grown exponentially over time. Image from Wikipedia

Is this reasonable?

Like I said, this is a fascinating article. But it is all speculation. We have no direct evidence of many of the claims and inferences made in this paper. Specifically, we don’t know:

  1. The exact size of the genome that various organisms had in the earth’s past
  2. The nature of possible organisms less complex than prokaryotes
  3. The existence of any alien microörganisms or evidence that any landed on early earth
  4. The speed of genetic complexity changes in the early earth environment, or on other planetary environments in the case of the alien microörganisms prior to arrival on earth
  5. Whether any modern earth organisms, or any potential alien microörganisms, could withstand the rigors of travel through space for the millions of years it would take to get to earth from another star system

Finally, we have no clear explanation why the rate of change in genome complexity should be exponential. The use of the Moore’s Law chart to show that exponential growth in complexity is reasonable is slightly disingenuous. Moore’s Law is generally used as a forecast for the future growth in complexity for a commercial product based on historical data. Further, the forecast is used to estimate product demand, research costs, and necessary production investment, all of which tends to drive advancements and make the prediction self-fulfilling.

On the other hand, genome complexity is not directed. Evolution is a random process that will generate greater complexity only if a new, more complex organism can take advantage of an ecological niche that cannot be exploited by simpler organisms. Nothing is driving greater genome complexity.

Anyway, this is a very controversial theory. But I believe it may lead to new insights regarding microbiology, astrobiology, molecular clock hypothesis, and the use of mathematical models in evolutionary biology.


How long is 13.7 billion years?

As a side note, sometimes we wonder, how could something as complex as DNA and cellular organisms have started from nothing? It seems impossible to comprehend. But if you take a look at Figure 1, you will see that it may have taken over six billion years for a single base pair of DNA to grow in complexity to form an organism that we think of as a simple, primitive prokaryote. Then it took another 3.5 billion years before mammals appeared. Then it only took 200 million more years before our human ancestors appeared. And finally only a few hundred thousand years passed before you and I were born.

To give you a feel for how long 13.7 billion years is, watch this extremely boring YouTube video that compresses each billion years into one minute increments.


Figure 3. Age of Universe, boring, even with music. Video still from Craig Hall


A final thought to close this blog post, there may be aliens living on earth, but don’t be afraid, because it’s us.

by George Taniwaki

A Dec 2011 article in The Fiscal Times purports to show that eating at restaurants is cheaper than cooking at home. It’s an intriguing idea that has appeared in many articles in the past. However, the analysis presented in The Fiscal Times article is flawed and the conclusions are not supportable.

Before going into the specifics of the errors in The Fiscal Times article, let’s consider how one could compare whether cooking at home is more expensive than eating at restaurants. The typical cost-benefit analysis for eating in versus dining out goes something like as follows. Cooking a meal at home isn’t free. From a classical economic point of view one should include the opportunity cost of the time needed to buy groceries , drive it home, store it,  prepare a meal, and clean up afterwards. Further, one should include the implicit rental value of the automobile used to transport the groceries and the kitchen and dining room used to prepare and serve the meal.

However, shopping, cooking, and cleaning are not just chores that one is required to do. They are a form of entertainment, social interaction, and a way to share your skills with others as Nathan Myhrvold insightfully states in this Dec 2011 Slate interview. The cook receives utility from hosting a meal, even if it is a regular daily event. Naturally, if one hates to shop, cook, or clean, then there can be disutility as well. When deciding whether to eat at home or dine out, a person will want to maximize the expected utility from the decision.

Examining the wrong factors

The Fiscal Times article briefly mentions some of the above factors, but then totally ignores them when doing the price comparisons. Instead, it mentions differing inflation rates between in-home meals and restaurant meals. Relative inflation rates should be irrelevant to the decision to eat at home or dine out. The author also throws in a few additional factors that also seem to be irrelevant in comparing costs,

“We also didn’t factor in whether one meal or another would be healthier, or friendlier to the environment. But that’s part of the point: Eating right and finding the extra savings that could be had by comparison shopping comes with a time trade-off that many families can’t afford to make these days.”

Hard to interpret charts

The Fiscal Times article has two time series charts which I will reproduce below. Some of the problems I found in the first chart:

  1. For some reason the first chart is labeled Chart 2 and the second is labeled Chart 1.
  2. Chart 2 (the first chart) has two different scales (left scale has a range of 1.4% while the right scale has a range of 0.4%) even though both display values from the same dataset (percent share of consumption). This means the data using the right scale will appear to be more variable
  3. The black arrows both point toward the right scale, though “Grocers” (what’s with the quotation marks?) says it is set to the left hand scale (LHS)
  4. Neither scale shows the 0% origin point or the 100% end point (Note that if the scale did go from 0 to 100%, then there would be no need for two different scales.
  5. Assuming the left scale applies to “Grocers” and the right scale to “Restaurants”, then “Grocers share is always above “Restaurants”. It does not cross as the chart shows
  6. There is no source attribution for the data so no way to judge how valid it is or to review the original data


The second chart (entitled Chart 1) also has several flaws.

  1. The color code has been reversed. Dining out share was shown in the blue line in the first chart, while inflation is shown in gold. Similarly, eat at home share was shown in gold in the first chart, while inflation is in blue
  2. The label for each line has changed. “Restaurants” in the first chart is now called Food away from home while “Grocers” is now Food at home
  3. The use of different labels makes one wonder if the same assumptions, data sets, and cost allocations are used in the two charts and whether the same analysts produced both charts. My guess is no, which means the two charts cannot be used together
  4. As mentioned above, relative inflation rate should not directly impact the consumer’s choice to eat at home or at a restaurant, so this chart isn’t very useful


Nonequivalent price comparisons

The Fiscal Times article includes a slideshow that compares the cost of selected meals at restaurants with the cost of preparing the meal at home. In five out of six cases, the restaurant meal is cheaper.

If you only consider the price of store-bought food to the price of a cooked meal at a restaurant, there is probably no way the prices of food ingredients in a competitively priced retail store could exceed the price in a non-subsidized restaurant. Certain restaurants can serve meals at lower than expected prices because of subsidized food (school lunch programs), volunteer labor (homeless shelters or church meal programs), or subsidized rent (canteen stores or cafeterias in office buildings).

So how did The Fiscal Times get these unlikely results? I think the following errors were made:

  1. The restaurant meal prices are for a single serving while the grocery store prices are for full cans, boxes, or other package. This will provide much more food than the restaurant meal
  2. The grocery store prices are for FreshDirect, a grocery delivery service in New York. Delivery groceries are more expensive than self-serve and NYC is the most expensive city in the U.S.
  3. The restaurant meal prices exclude the tip
  4. The grocery store prices include some prepared deli foods. Grocery store deli food can be more expensive than restaurant food since it is an impulse buy

*An orangery is a British term for a greenhouse. They were mostly used to grow citrus fruit (hence the term orangery). Now most citrus fruit in Great Britain is imported. One of the most famous orangeries is located in the Royal Botanic Gardens in London. It is now used as a restaurant where people dine out, not a greenhouse used to grow oranges that people eat at home.

[Update: Expanded the footnote to make clear the irony of an orangery being used as a restaurant.

For a clearer explanation why differing inflation rates should not affect the choice between eating in and going to a restaurant, see this Sep 2013 blog post.]