Animated Covid-19 map, screenshot from Domo

by George Taniwaki

In order to make predictions about the future trajectory of the spread of Covid-19, you need to be able make sense of the currently available data. There are several steps to get good data.

Medical event data

First, you have to be able to collect data from multiple sources, clean them, and aggregate them based on a standard criteria. Each data record could include the following elements:

  1. Event (what was counted, e.g., tests administered, positive test results, negative results, hospital admissions, ICU status, ventilation status, discharges, recoveries, deaths, etc.)
  2. Location ID (where the event occurred, see below)
  3. Date of incidence (when the event occurred)
  4. Date of reporting (sometimes data is reported days or even months after the event and can be updated many times as errors are corrected or missing data is estimated)
  5. Value (a count)

The best repository of Covid-19 data is maintained by the New York Times (on GitHub) with an interactive viewer. Johns Hopkins University Coronavirus Resource Center also has a dataset. The best source for counts of tests in the U.S. is available from the Covid Tracking Project sponsored by the Atlantic.


One of several graphics available from the New York Times

Public policy change data

In addition to medical events, there are public policy events that can be tracked, such as government orders to close nonessential businesses, travel restrictions, and so forth. These records could include the following elements:

  1. Event (what type of public policy change was made)
  2. Location ID (where the change applies to, see below)
  3. Date of incidence (when the change was implemented)
  4. Date of reporting (when change was reported, usually before the change is implemented)

Unfortunately, I could not find a centralized source of information on government restrictions and the dates they became effective. A different source of information that can help indicate how much contact there is between people is the amount of movement by people who carry smartphones. Smartphones contain a GPS antenna and can report their position. The position can be used to indicate what type of activity the person is engaging in. Google Health has a community mobility report that is updated regularly. An example report is shown below and the data in .csv format is available for download.


Among those who own Android smartphones and participate in tracking, trips have declined. Screenshot from Google Health

Demographic and geographic data

To analyze the data, you will want append demographic and geographic data about the locations. Unlike events, demographic and geographic data changes slowly, so only needs to be collected once during the model building process. The following data elements could be useful to prepare a model of forecast:

  1. Location ID (from above)
  2. Name or description
  3. Location hierarchy (continent > country > region > state > county > city > zip code, etc.)
  4. Latitude and longitude of centroid
  5. Latitude and longitude of center of largest city
  6. Surface area (km3)
  7. Total population
  8. Age distribution
  9. Gender distribution
  10. Income distribution
  11. Race distribution
  12. Political party affiliation distribution
  13. Health insurance coverage distribution
  14. Comorbidity distribution (smoking, diabetes, etc.)
  15. Number of hospitals
  16. Number of hospital beds
  17. Number of ICU beds
  18. Number of ventilators

Some good sources for this type of data are US Census, United Nations Demographic Year Book, United Nations Development Programme’s (UNDP) Human Development Report and the World Bank’s World Development Report, Gapminder, and ESRI.

Visualize the data

Once the data is aggregated, there are many ways to visualize it. Maps are an obvious way to display location data. Line charts are an obvious way to display time series data. Domo, a developer of business intelligence software, has very nice animation that displays time series data on a map (screenshot at top of blog).

Two caveats about their display. First, the number of cases is underreported because testing for infection was not widespread early in the pandemic, and is still too low today.

Second, outside the U.S. the data is by reported by country, not state or other smaller region. A single marker is used to represent the location of events. This is probably fine for Europe or Africa, where countries tend to be small. However, it is misleading for larger countries like Canada, Russia, China, Indonesia, Australia, and Brazil. Even data for a states like California is distorted because one would expect separate markers for the Bay Area and for the LA Basin instead of a single one in the middle of the state.

Johns Hopkins Center for Systems Science and Engineering has produced a nice dashboard hosted on ArcGIS (screenshot below). It does a better job of dividing large countries into smaller geographic partitions, but the colors are dark. A description of the project was published in Lancet Infect Dis (Feb 2020) and in a press release (Jan 2020). All of the data and the dashboard are available in a GitHub repository.


Another example of a Covid-19 map. Screenshot from ArcGIS

A note about line charts. You often see Covid-19 growth charts by country that display time (either calendar date, or days since the nth event occurred) on the horizontal axis and count on the vertical axis. Both are scaled linearly. I find these charts hard to interpret and compare. I think a better way to display growth data is to display data on the vertical axis using logarithm of counts per 100,000 population and on the horizontal axis using days since the n*(population/100,000)th event occurred. Even better would be to divide large countries into smaller regions so that all the charts covered regions with similar populations.

Making Forecasts

There are many groups making forecasting of Covid-19 infection rates and death rates. The CDC has a summary of them along with its own ensemble forecast. It predicts under 100,000 deaths in the U.S. at the end of May. The Institute of Health Metrics and Evaluation (IHME) predicts about 72,000 total deaths at the end of May but with a range from 60,000 to 115,000. You can download the data from the Global Health Data Exchange.

In addition to forecasting deaths, the IHME forecasts hospital utilization. These forecasts are used by hospitals to schedule resources and plan for peak usage.


Individual forecasts of cumulative reported deaths in U.S. from Covid-19 (left) and CDC ensemble forecast (right). Image from CDC


Cumulative death forecast in U.S. Image from IHME.

One of the best forecasts I have seen was produced by the Economist. It synthesizes data from US Census, New York Times, Covid Tracking Project, IHME, Google Health, and Unacast. The choropleth map of the U.S. below shows risk factors for Covid-19 mortality at the county level. Green shows areas where the risk level is low (less than 1%) and red shows high (6% or above).


Dixie in the crosshairs. Image from Economist

* * * *

Update1: In just one day, the IHME forecast is obsolete. See my response at

Update2: Add link to New York Times dataset and interactive viewer