Statistics | Real Numeracy

December 31, 2021

Happy birthday to you

Everyday is somebody’s birthday. Image from wishbirthday.org

by George Taniwaki

I have a Facebook friend whose birthdate was listed as January 1. When I saw this, I thought, “That can’t be true. Nobody is born on January 1.” But later, I realized that isn’t true. People are born on every day of the year. So just how likely is it that my friend’s birthdate in January 1?

Distribution of actual birthdates

If birthdates are distributed evenly, then every day should have the same number of births. Even if births are distributed evenly on all days, not every birthdate will have the same number of births because of leap years. Thus, except February 29, each day has a probability of 1 out of 365.242, or about 0.00274. Having a birthdate on February 29 would have a probability of 0.242 out of 365.242 or about 0.00066.

Birthdates are not randomly distributed, there is some seasonality. More children are born in the summer than in the winter (Wikipedia). Separately, studies show that childhood mortality may be dependent on birthdate. For example, death after preterm birth may be higher for babies born in the summer Epidemiology Sep 2009 and juvenile skin cancer deaths may be higher for babies born in the spring Brit J. Cancer Oct 2014.

Further, delivery date can be strongly influenced by desire of both the mother and the care giver. For instance, an obstetrician may not want to deliver a baby or make post-delivery rounds on weekends. If so, they may induce labor early in the work week. In the U.S., more babies are born on Monday through Wednesday than on Thursday through Sunday (Wikipedia). This has no effect on day of year distribution though because weekends can fall on any day. However, the desire to avoid non-work day births can affect holiday births. Many holidays are observed on Mondays, which will spread the distribution effect. However, some holidays are still observed on a specific date, including New Year’s Day, U.S. Independence Day, and Christmas. One would expect fewer births on or the days leading up to January 1.

A distribution of birthdates from 480,040 life insurance policy application forms submitted between 1981 to 1994 is available at https://www.panix.com/~murphy/bdata.txt. The counts and probability are plotted in Figure 1 below.

Figure 1. Distribution of birthdates from life insurance policy applications

The dip on February 29 is expected. There is also a dip between December 22 through December 26, likely days that both pregnant patients and caregivers want to avoid spending time in a hospital.

There are 1482 birthdates listed as January 1 (p = 0.00308) which is significantly higher than expected. It is also significantly higher than the number of births on December 31 (0.00281) or January 2 (0.00252). It appears that both pregnant patients and care givers like to delivery babies on New Year’s Day. However, given a choice between a December 31 and January 1 birth, this is poor tax planning strategy (IRS).

Anyway, it isn’t that unlikely that my friend’s birthdate is actually January 1.

Here we assume that the life insurance birthdate data is accurate and truthful. That’s because people do not have much incentive to lie about their birthdate, especially since a false statement on an application can cause the company to reject a claim.

Distribution of stated birthdates

Before we conclude that my friend was born on January 1, there is a second issue. That is, why did I think that my friend’s birthdate is not January 1? Shouldn’t I believe everything people post on Facebook (Independent, Mar 2015, PLOS One Feb 2015)?

My friend’s birthdate may not be January 1 even if they say it is. Perhaps they don’t know when their actual birthdate is (they were adopted and have never seen their birth certificate) and list it as January 1 because it seems like a good default date to use. Maybe they consider their birthdate to be private information and use January 1 as their publicly stated birthdate. Perhaps they think their real birthdate is uninteresting and use January 1 to make it more interesting. Finally, maybe they just like to lie.

Table 1 below shows all the combinations of actual birthdates (rows) and self-reported birthdates (columns).

Table 1. Probability of all combinations of self-reported and actual birthdates

The cells are color coded. The legend is shown in Figure 2. The yellow cell in the upper left corner is the probability that a person is telling the truth; they say their birthdate is January 1 and it actually is. The green shaded cells represent people who lie. They say their birthdate is January 1, but it is not. The blue shaded cells also represent people who lie. They are born on January 1 but state their birthdate is a different date. The remaining yellow shaded cells and the white cells represent all the other combinations that do nor involve January 1.

Figure 2. Legend for Table 1

The sum of all the probabilities in the first row (yellow shaded and blue shaded cells) will give the probability that the actual birthdate is January 1. The sum for any row gives the probability the actual birthdate is the i^th day of the year. We can estimate this probability using the life insurance data listed above.

The sum of all the probabilities in the first column (yellow shaded and green shaded cells) will give the probability that the self-reported birthdate is January 1. The sum for any column gives the probability a person will say their birthdate is the j^th day of the year. Facebook has this data for Facebook users, but I was not able to find any reference to it.

Measuring the nonsampling error

To find the probability that my friend is telling the truth, we need to know the ratio of truth telling to lying about birthdate for everyone who claims to be born on January 1. Unfortunately, we cannot estimate it by simply combining the life insurance birthdate data with the Facebook birthdate data. The life insurance dataset and the Facebook dataset each contain 366 values. The probability matrix has 366 x 366 = 133,956 cells. Even if we make some simplifying assumptions about truth-telling vs lying, there are too many degrees of freedom in the matrix to fill it.

Conclusion

We cannot tell if my friend is telling the truth or not and do not know if their birthdate is January 1 or not. So happy birthday!

[Correction1: Changed “women” to “patients”

Correction2: Clarified that some holidays are on Monday and others on a specific date]

May 12, 2020

Happy birthday Florence Nightingale

Posted by gtaniwaki under Health care, Real Numeracy | Tags: Covid-19, History, Statistics |
Leave a Comment

Example of polar area chart showing causes of mortality among soldiers by month during Crimean war. Image from Wikimedia

by George Taniwaki

May 12 is International Nurses Day to recognize the contribution nurses make and to celebrate the birth of Florence Nightingale. Today marks the 200th anniversary of her birth. The World Health Organization named this year the Year of the nurse and midwife in her honor. Certainly, with the Covid-19 pandemic in full force, 2020 will be remembered as the Year of the nurse for many years to come.

Ms Nightingale, who was born in Florence, Italy was the founder of the modern nursing profession. Prior to her efforts, nursing was a volunteer activity, most often undertaken by untrained family members, soldiers, or religious members. Ms Nightingale trained nurses during the Crimean War. She later founded the first secular nursing school and published many nursing textbooks.

In addition to advancing nursing in a clinical setting, Ms Nightingale was a social activist who advocated for more government spending on healthcare for the poor. She helped develop the field of public health nursing to reach patients who were poor and sick at home.

Finally, Ms Nightingale was an incredible statistician and a pioneer in data visualization. She kept thorough notes and documented which treatments worked and which did not, making it possible for others to replicate her results. She popularized a type of pie chart that she called a coxcomb (see image above) and is now known as a polar area chart. She was the first woman elected to the Royal Statistical Society and became an honorary member of the American Statistical Association.

May 4, 2020

Tracking the growth of Covid-19 redux

Posted by gtaniwaki under Health care, Real Numeracy | Tags: Covid-19, Maps, Statistics |
Leave a Comment

Yesterday’s forecast (left) and today’s (right). Images from IHME

by George Taniwaki

The Covid-19 story moves very fast. Yesterday, I posted a blog entry with a chart showing that the Institute of Health Metrics and Evaluation (IHME) forecast 72,000 deaths in the U.S. by June 2020 with almost no new deaths between then and August 2020. The IHME forecast assumed that stay-at-home orders will remain in place until August (see chart above left).

Today, three interesting pieces of news were reported. First, the IHME abandoned its assumption that the population will stay at home and instead switched to using smartphone location data provided by mobile carriers to estimate population mobility. This boosted their estimate of deaths in August to 134,000, an increase of 62,000 (see chart above right).

Second, a group of data scientists led by the University of Sydney’s Centre for Translational Data Science has reviewed the forecasts by IHME and found that they underestimate the uncertainty associated with COVID-19 deaths. 70% of the state level forecasts were outside the 95% prediction interval (Arxiv May 2020). You should only expect 5% of the forecasts to be outside the 95% prediction interval.

Finally, in yesterday’s blog post I discussed the ensemble forecast of Covid-19 deaths created by the Center for Disease Control and Prevention (CDC). Until last week, the IHME forecast was included in the ensemble. On Friday, the CDC dropped the IHME forecast from its ensemble and replaced it with forecasts from Imperial College.

The IHME forecast was lower than most other forecasts and had been a favorite of the Trump administration (Politico Apr 2020) and of the Center for Disease Control (CDC) (Medium Apr 2020).

Last week, the CDC ensemble forecast (left) included the IHME data but does not this week (right). Image from CDC

May 3, 2020

Tracking the growth of Covid-19

Posted by gtaniwaki under Health care, Real Numeracy | Tags: Covid-19, Maps, Sampling bias, Software, Statistics |
Leave a Comment

Animated Covid-19 map, screenshot from Domo

by George Taniwaki

In order to make predictions about the future trajectory of the spread of Covid-19, you need to be able make sense of the currently available data. There are several steps to get good data.

Medical event data

First, you have to be able to collect data from multiple sources, clean them, and aggregate them based on a standard criteria. Each data record could include the following elements:

Event (what was counted, e.g., tests administered, positive test results, negative results, hospital admissions, ICU status, ventilation status, discharges, recoveries, deaths, etc.)
Location ID (where the event occurred, see below)
Date of incidence (when the event occurred)
Date of reporting (sometimes data is reported days or even months after the event and can be updated many times as errors are corrected or missing data is estimated)
Value (a count)

The best repository of Covid-19 data is maintained by the New York Times (on GitHub) with an interactive viewer. Johns Hopkins University Coronavirus Resource Center also has a dataset. The best source for counts of tests in the U.S. is available from the Covid Tracking Project sponsored by the Atlantic.

One of several graphics available from the New York Times

Public policy change data

In addition to medical events, there are public policy events that can be tracked, such as government orders to close nonessential businesses, travel restrictions, and so forth. These records could include the following elements:

Event (what type of public policy change was made)
Location ID (where the change applies to, see below)
Date of incidence (when the change was implemented)
Date of reporting (when change was reported, usually before the change is implemented)

Unfortunately, I could not find a centralized source of information on government restrictions and the dates they became effective. A different source of information that can help indicate how much contact there is between people is the amount of movement by people who carry smartphones. Smartphones contain a GPS antenna and can report their position. The position can be used to indicate what type of activity the person is engaging in. Google Health has a community mobility report that is updated regularly. An example report is shown below and the data in .csv format is available for download.

Among those who own Android smartphones and participate in tracking, trips have declined. Screenshot from Google Health

Demographic and geographic data

To analyze the data, you will want append demographic and geographic data about the locations. Unlike events, demographic and geographic data changes slowly, so only needs to be collected once during the model building process. The following data elements could be useful to prepare a model of forecast:

Location ID (from above)
Name or description
Location hierarchy (continent > country > region > state > county > city > zip code, etc.)
Latitude and longitude of centroid
Latitude and longitude of center of largest city
Surface area (km³)
Total population
Age distribution
Gender distribution
Income distribution
Race distribution
Political party affiliation distribution
Health insurance coverage distribution
Comorbidity distribution (smoking, diabetes, etc.)
Number of hospitals
Number of hospital beds
Number of ICU beds
Number of ventilators

Some good sources for this type of data are US Census, United Nations Demographic Year Book, United Nations Development Programme’s (UNDP) Human Development Report and the World Bank’s World Development Report, Gapminder, and ESRI.

Visualize the data

Once the data is aggregated, there are many ways to visualize it. Maps are an obvious way to display location data. Line charts are an obvious way to display time series data. Domo, a developer of business intelligence software, has very nice animation that displays time series data on a map (screenshot at top of blog).

Two caveats about their display. First, the number of cases is underreported because testing for infection was not widespread early in the pandemic, and is still too low today.

Second, outside the U.S. the data is by reported by country, not state or other smaller region. A single marker is used to represent the location of events. This is probably fine for Europe or Africa, where countries tend to be small. However, it is misleading for larger countries like Canada, Russia, China, Indonesia, Australia, and Brazil. Even data for a states like California is distorted because one would expect separate markers for the Bay Area and for the LA Basin instead of a single one in the middle of the state.

Johns Hopkins Center for Systems Science and Engineering has produced a nice dashboard hosted on ArcGIS (screenshot below). It does a better job of dividing large countries into smaller geographic partitions, but the colors are dark. A description of the project was published in Lancet Infect Dis (Feb 2020) and in a press release (Jan 2020). All of the data and the dashboard are available in a GitHub repository.

Another example of a Covid-19 map. Screenshot from ArcGIS

A note about line charts. You often see Covid-19 growth charts by country that display time (either calendar date, or days since the nth event occurred) on the horizontal axis and count on the vertical axis. Both are scaled linearly. I find these charts hard to interpret and compare. I think a better way to display growth data is to display data on the vertical axis using logarithm of counts per 100,000 population and on the horizontal axis using days since the n*(population/100,000)th event occurred. Even better would be to divide large countries into smaller regions so that all the charts covered regions with similar populations.

Making Forecasts

There are many groups making forecasting of Covid-19 infection rates and death rates. The CDC has a summary of them along with its own ensemble forecast. It predicts under 100,000 deaths in the U.S. at the end of May. The Institute of Health Metrics and Evaluation (IHME) predicts about 72,000 total deaths at the end of May but with a range from 60,000 to 115,000. You can download the data from the Global Health Data Exchange.

In addition to forecasting deaths, the IHME forecasts hospital utilization. These forecasts are used by hospitals to schedule resources and plan for peak usage.

Individual forecasts of cumulative reported deaths in U.S. from Covid-19 (left) and CDC ensemble forecast (right). Image from CDC

Cumulative death forecast in U.S. Image from IHME.

One of the best forecasts I have seen was produced by the Economist. It synthesizes data from US Census, New York Times, Covid Tracking Project, IHME, Google Health, and Unacast. The choropleth map of the U.S. below shows risk factors for Covid-19 mortality at the county level. Green shows areas where the risk level is low (less than 1%) and red shows high (6% or above).

Dixie in the crosshairs. Image from Economist

* * * *

Update1: In just one day, the IHME forecast is obsolete. See my response at https://realnumeracy.wordpress.com/2020/05/04/tracking-the-growth-of-covid-19-redux/

Update2: Add link to New York Times dataset and interactive viewer

April 16, 2020

The Covid-19 tracking app won’t work

Posted by gtaniwaki under Health care, Real Numeracy | Tags: Access to healthcare, Covid-19, Game theory, Probability, Sampling bias, Statistics |
Leave a Comment

Track this. Photo from Bloomberg BusinessWeek by Karen Ducey/Getty Images

by George Taniwaki

In a Bloomberg Businessweek editorial (Apr 2020), Cathy O’Neil (mathbabe) explains why a Covid-19 tracking app won’t work. It’s all about self-selection bias.

* * * *

Update: For a good non-technical description of how the Apple and Google contact tracing API works, including the encryption method, see Economist, Apr 2020. The article also suggests that even though using an app for contact tracing is imperfect, its low-cost and passive nature makes it worthwhile.

April 10, 2020

A partially effective vaccine may hurt us

Posted by gtaniwaki under Health care, Real Numeracy | Tags: Access to healthcare, Covid-19, Economics, Game theory, Probability, Statistics |
Leave a Comment

Can a partially effective vaccine flatten the curve?

by George Taniwaki

During this Covid-19 pandemic, we want to know when we can stop sheltering at home and go back into public spaces again. Further, we want to know which actions can speed up the time before that can happen.

One thing we do know is that when dealing with a novel disease (one that no human appears to have immunity for), the entire population cannot go back to pre-epidemic behavior at the same time before it is safe. Doing so will cause a spike in infections and deaths. This will terrorize the population leading to another round of isolation. If the public loses faith that the government knows when it is safe to change behavior, then when it finally is safe, people will still be afraid and time will be lost during the recovery, causing additional economic hardship.

So when can we go back to normal? I think that can happen only after herd immunity is achieved. This can take a very long time as a trickle of individuals become infected and recover with resistance or die, a process called flattening the curve. Or it can happen pretty quickly after the wide-spread inoculation of individuals with a safe and effective vaccine.

An effective vaccine may take 18 to 24 months to develop. Many people, including President Donald Trump, think staying home this long is unrealistic. Is it possible to shorten that time by releasing a partially effective vaccine sooner? Doing so may help flatten the curve without requiring social distancing.

Partially effective vaccines

An intriguing paper by Eduard Talamàs & Rakesh Vohra, entitled “Free and perfectly safe but only partially effective vaccines can harm everyone” pretty much contains the answer in its title.

The idea is that a partially effective vaccine will cause people to change their behavior too much, too soon, causing the spike we want to avoid. The conclusion is similar to the analysis popularized by Sam Peltzman of the Univ. Chicago (a microeconomics professor while I was a student there) who suggested that stricter automobile safety regulations could lead to increased deaths (of pedestrians) as drivers felt safer and became more reckless (J Polit Econ, Aug 1975).

The most important conclusion in Talamàs et al., is that with overlapping social networks, even those who do not increase the size of their networks after the introduction of the vaccine can be harmed by those who do. This conclusion is slightly different than those of most epidemiological models that assume random contact between individuals rather than strategic networks. A good description of the paper is given by one of the authors, Vohra, at The Leisure of the Theory Class (Apr 2020).

November 28, 2019

Google Maps vs Waze

Posted by gtaniwaki under Personal, Real Numeracy | Tags: Maps, Search engine, Statistics |
1 Comment

Not how the crow flies to get to work each day

by George Taniwaki

I really don’t like to drive to work. I’ll do almost anything to avoid a long commute. For most of my adult life I have either walked or ridden a bus to get to work. Yes, it’s possible. I always chose an apartment to rent or house to buy based on how close it is to my job. And once I’ve found a place to live, I usually reject a new job unless it’s within walking or riding distance. It helps that I like living in big cities.

But I just started a contract assignment in Mountlake Terrace, a suburb of Seattle about 20 miles from where I live as the crow flies (see image above, or not). This isn’t the longest commute in my life, but the first long one in a few decades. And it’s the first long commute where I’m driving alone rather than in a carpool.

Google Maps, initial attempt

The traffic in Seattle is awful. To reduce my commute time, I’ve decided that starting my drive at 6:30 and working from 7:30 to 4:00 will help. Before my first day to the job site, I pull out Google Maps and plot my route (see Fig 1).

Figure 1. My first route to the office, average 42 minutes every morning

For the analysis in this blog post, I split my drive into segments based on type of driving. Segment A consists of surface streets from my house to the highway. B is 17 miles at highway speed driving north, away from the city, C is a slow slog where I double-back and join the commuters coming into town, and D is the final, short segment of surface streets to the office.

The table below shows details of my commute. Most of it is tolerable. But notice that segment C (the red zone) constitute less than one-sixth of my commute distance but over one-third of my commute time.

Map	Description	Dist.	Speed	Elapsed Time
A	Surface streets from home to SR520	3	30	6
B	SR520 to I-405 to I-5	17	60	17
C	I-5 to Mountlake Terrace	4	15	16
D	Surface streets from I-5 to office	1	20	3
	TOTAL	26	37	42

Google Maps, redux

After a couple weeks of following this route, I’ve learned which lanes to use on which segments to slice a few minutes off my commute. But I think I can still do better. I check Google Maps for some alternatives. This time it gives me a completely different, and unexpected, route. It tells me to go a few miles out of my way south to I-90 and drive north through the city on I-5 (see Fig 2).

I-5 in downtown Seattle is one of the most congested highways in the U.S. I get queasy every time I drive it worrying about getting stuck. But maybe it’s not so bad at 6:45 AM, which is about what time I will get there. So I trust Google Maps and try it.

Figure 2. Google Maps’ new suggestion, average time 44 minutes

It works. I try the route on two consecutive days. The table below shows the average results. There is congestion on I-5 between I-90 to Olive Way (Segment C red zone) but it is a shorter segment than my previous commute. However, the route is longer, so it doesn’t save much time. Further, I don’t like this route because it limits my options. If there is an accident or other delay, I will be stuck in traffic with no easy way to avoid it.

Map	Description	Dist.	Speed	Elapsed Time
A	Surface streets from home to I-90	4	30	8
B	I-90 to downtown	8	60	8
C	I-5 to Olive Way	2	20	6
D	I-5 to Mountlake Terrace	14	60	14
E	Surface streets from I-5 to office	1	20	3
	TOTAL	29	42	39

Waze to the rescue

Waze is a GPS navigation app originally developed in Israel but quickly went global. It uses traditional digital map data and combines it with real-time location data from users including speed, route, reports of traffic jams, accidents, police speed traps, and gasoline prices at nearby stations. Thus, the more people who use it, the more accurate it becomes.

Waze also shows you the current toll price (Seattle uses variable toll pricing) and lets you avoid tolls, ferries, or highways, if desired, when choosing a route.

Google (now Alphabet) acquired Waze in 2013 but it remains a separate entity from Google Maps. Because Waze collects potentially personally identifiable information (PII), it has a less restrictive user agreement than Google Maps and warns users of that fact. (Though most people never read the agreement and just click “I accept”.)

Generating a route is a highly resource intensive calculation that often involves machine learning. To simplify the work, Google Maps generally limits routes to major arterial streets. Waze combines those calculations with the actual routes users are taking to find the minimum travel time. Thus, Waze often creates routes that run through residential neighborhoods. Of course, the neighbors sometimes complain or even fight back by generating fake route data (Wash Post, Jun 2016).

Figure 3 below shows the route Waze recommends for my commute. It looks just like the original route that Google Maps suggested, except for the last segment. I still take the I-5 cloverleaf, but instead of continuing onto I-5, it has me veer right and use side streets to get to the office.

Figure 3. My new favorite route to work

Map	Description	Dist.	Speed	Elapsed Time
A	Surface streets from home to SR520	3	30	6
B	SR520 to I-405	17	60	17
C	Surface streets from I-405 to office	6	30	12
	TOTAL	26	44	35

The best part is that I can see I-5 from the I-405 off-ramp. When traffic is light (speed is 30 mph or more), I can veer to the left and take I-5 to the office. When I-5 is congested, I can veer to the right and take surface streets. While on the surface streets, I can continue to see I-5 and confirm whether I made the right decision and improve my choice for future days.

Waze leads me astraze

With my success with Waze in the morning, I decide to use it for my evening commute home as well. As I turn onto I-5, Waze tells me there is road kill ahead. I wonder where. Then suddenly I see a raccoon and am jolted by the thump. It saddens me to know that I’ve squashed an innocent animal under my tires, even if it is already dead.

The rest of the commute home is uneventful until Waze tells me to exit I-405 at NE 85th St in Kirkland (Fig 4b), 4 miles before my usual exit at SR520 (Fig 4a). Gee, that seems like a bad idea. Should I ignore Waze and keep going straight? Or should I take the exit? Maybe there is an accident on my regular route. Or maybe the crowd of Waze users knows a sneak route. Well, Waze has been pretty accurate so far, so I take the exit.

Ugh, what a mistake. Driving east on NE 85th St takes me straight into a huge traffic jam on Redmond Way. Also, there is a giant construction project on the Microsoft campus, so West Lake Sammamish Pkwy is overflowing with drivers avoiding lane closures on 156th Av NE. My commute today is more than 35 minutes longer than usual. I won’t do that again.

Figures 4a, b. My normal commute home, Waze suggestion for 11/21/2019

Conclusion

Both Waze and Google Maps show you unexpected options and are likely to give better routes than you could find on your own. Overall, my experience with Waze was better than Google Maps, but both could use improvements.

* * * *

All this talk about commute time has me remembering a brain teaser from my childhood. Let’s say I want my average commute speed to be 40 mph. One day, I get stuck in traffic and cover the first half of the distance to work at an average speed of 20 mph. How fast do I have to drive on the second half to meet my goal? Hint: The answer is not 60 mph or even 80 mph.

Real Numeracy

Happy birthday to you

Distribution of actual birthdates

Distribution of stated birthdates

Measuring the nonsampling error

Conclusion

Happy birthday Florence Nightingale

Tracking the growth of Covid-19 redux

Tracking the growth of Covid-19

Medical event data

Public policy change data

Demographic and geographic data

Visualize the data

Making Forecasts

The Covid-19 tracking app won’t work

A partially effective vaccine may hurt us

Partially effective vaccines

Google Maps vs Waze

Google Maps, initial attempt

Google Maps, redux

Waze to the rescue

Waze leads me astraze

Conclusion

Search this blog

RSS feeds

Categories

More by other kidney donors

More by other data crunchers

Archives