This is a constantly evolving product and is meant to be "unfinished" in that we will constantly be making improvements as we think of - and have time to implement - them. Given the time-sensitive nature of the COVID-19 pandemic, we wanted to get our analysis "out there" as quickly as possible while maintaining the quality we would expect from other high-quality data science projects.
The basic functionality of the map includes:
Note, some other projects calculate this value using the current trajectory of the disease in a location, but we have opted for a simpler metric that is more interpretable. Our approach is more of a retrospective look on how a county is doing currently relative to the recent past.
As per Thompson et al. (2019): "R(t) represents the expected number of secondary cases arising from a primary case infected at time t. This value changes throughout an outbreak. If the value of R(t) is and remains below one, the outbreak will die out. However, while R(t) is larger than one, a sustained outbreak is likely. The aim of control interventions is typically to reduce the reproduction number below one."
Because R(t) is sensitive to reporting abnormalities that may occur day-to-day, we provide smoothed metrics for 3- and 7-day time scales.
For more details on the R(t) metric used in this project, see Nick Clark's Git Repo
This probability is calculated by dividing the number of people who have COVID-19 (but are not hospitalized) by a county's total population. The number of non-hospitalized people with COVID-19 is based on the following assumptions:
Using these assumptions, we calculate the number of infectious people in the population by finding the number of people who tested positive within the infectious period, multiplying by the undetected case figure, and subtracting the number of people who are likely to have been hospitalized (based on the age demographics of a county).
Note 1: Because there are so many assumptions baked into this metric, it is certainly not the exact probability that a person on the street is infected. If we assume, however, that these assumptions are fairly stable across counties, we can compare the numbers to get an idea of relative risk between locations.
Note 2: This probability should be considered like a geographically informed prior. There are many other things that need to be accounted for when assigning a probability that an individual is infected. For example, if you go to a bar, the probability that an individual is infectious is probably much higher than the number we report. Bars are filled with people who are (likely regularly) engaging in behavior that may expose them to the virus. Conversely, a person in the waiting room at a doctor's office is probably much less likely to be infectious than our metric. Doctor's offices require temperature checks and screening surveys that lower the probability of an infectious person sitting in the waiting room.
The data used to generate the map is generated by an R script that is run daily. We are constantly updating this file and it is not stable enough to release publicly at this stage. Eventually, we will add this munge file to the repository.
The case counts and deaths are sourced from USA Facts. This is the best county-level data set we have found, and it is consistently updated daily. The remainder of the fields are calculated using team's internally developed methodologies.
This map is a fairly vanilla Leaflet implementation with Javascript (and jQuery) used to implement additional functionality. I stayed away from Leaflet plugins for the most part because they are generally pretty rigid in their implementation and I am particular. The layers on the map are sourced by a geojson created from US Census county boundary shape files.
The timeline controls were manually created and added as Leaflet control layers. Adding custom control layers allows for arbitrary HTML elements to be included "on top" of the map. We used jQuery to listen for changes to the time slider which then updates the map using the data for that specific date. The "play" button starts iterating the time slider by one tick on a set time interval (starting either at the current location of the time slider or restarting at the beginning if the slider is in the last position).
Clicking on a county produces a timeseries using Apache's Echarts Javascript implementation. While a user could find the same information by adjusting the time slider, seeing the data in one timeseries visualization supports questions like, "how bad was the COVID-19 pandemic in county X?"
Colors are one of the most controversial topics when it comes to any visualization. We chose to go with a yellow-to-red scale, using grey for NA values. The intent is to get close to an intensity scale while maintaining readability for people with and without color blindness.
Color thresholds are a tricky and are being constantly reevaluated. If possible, we try to make the thresholds meaningful. For example, an R(t) below 0 represents a "good" situation in that case count should continue to decrease so we assign the first threshold at 1 for the R(t) scales.
Ian is a data scientist and is the lead developer on the BigMap project.
Nick is a statistician and creator of the R(t) statistic used on the BigMap project.
Please submit an issue on the project repository and we'll do our best to address it/get back to you.