Exploring Ontario Covid Case Data… part 1

Since I live in Ontario, the status of causes is rather important for me. The provincial government has made ‘some’ data available on their website. To be honest, there is a fair bit of information that isn’t available for whatever reasons (their excuses fall short once you learn that other groups do release such data).

Note: I am not an epidemiologist. Any results and conclusions are made for my own personal amusement and education (and perhaps to show a possible employer).

So for part 1, the Status of COVID-19 cases in Ontario data set is explored somewhat. First in a Jupyter notebook the data was examined and cleaned a bit. A few things were noticed:

The dataset is polluted with a ton of NAs. They will be converted to zeros
There is a large range of data, the ‘Total patients approved for testing as of Reporting Date’ field is orders of magnitude larger than everything else. Will need to be a little clever in displaying the results (that is will be using an automatically scaling y-axis).
A few days are missing. This only became noticeable to me because of a programming error (dates would not get looked up correctly since they didn’t exist). This gives me a little less confidence in the data.
There are 3 or 4 different groups of datasets within the overall dataset… more details below:

Subset 1: very incomplete data. The confirmed negative, presumed positive, and presumed negative datasets are incomplete. Data has stopped being measured for these fields. Below are the plots showing why they have been dropped for any further analysis.

Subset 2: Hospitalized cases. The columns Number of patients hospitalized with COVID-19, Number of patients in ICU with COVID-19, and Number of patients in ICU on a ventilator with COVID-19 have only been reported since April 2, 2020. This is approximately half the time of most of the other measurements. Hence these will be graphed together.

Subset 3: LTC cases. The columns Total Positive LTC Resident Cases, Total Positive LTC HCW Cases, Total LTC Resident Deaths, and Total LTC HCW Deaths have only been reported since May 19, 2020. These will be also graphed together.

Subset 4: Remaining data. This is the core data including: Confirmed Positive, Resolved, Deaths, Total Cases, Total patients approved for testing as of Reporting Date, Total tests completed in the last day, and Under Investigation. Therefore, these will also be graphed together.

A few comments about some of the numbers before we graph anything:

The Ontario health care system still uses Fax machines to transmit critical health information (many articles on this… [1], [2]). Due to Covid, the Ontario College of Pharmacists will allow Email instead of faxing for prescriptions.
Some mistakes happen. Recently there was this issue with 700 cases.
There is the on going controversy of how Covid deaths are counted
Also there can be some delays in how and when various statistics are reported.

Note: The data sets are manually updated since I would like to verify that no major changes was performed on the data (such as adding/removing columns). Hence the state of the data on this page can be a few days old.

Overall Values

Starting with all of the overall values. A small tool was made to view the data using JS + D3 to explore the values. A few things to note about this data viewer:

The buttons on the side control what is drawn. The top three control the sub-set or group of data that was mentioned above. The bottom buttons act as both a legend and used to hide/show a particular dataset.
The graph automatically rescales to the maximum of the selected datasets.
The data is drawn in two particular ways based on the path length. For shorter paths a linear step technique is used, while for much longer paths an interpolated curve is used. When a very long path is drawn with a step technique with a smaller resolution, it almost approximates a curved line; so might as well let it be curved so it would be a bit smoother.
The colours were chosen using the d3.schemeCategory10 colour scheme
Instead of using tooltips, I chose to list values below the graph. There are two main reasons for this choice: 1) there can be a lot of information, and hence a huge tooltip would be required (which might hide the graphs and become more annoying than helpful). 2) If necessary, values can be copied and pasted this way.
Finally, I have to say that I’m new to javascript programming so some choices made might not be the best (particularly with how items are hidden/shown).

Note that for optimal viewing it is best to use a computer instead of a tablet or phone. Click here to have the viewer without any wordpress page decorations.

Initially viewing the data is a little confusing. Some of the values are cumulative totals while others are the current value of a particular group.

Deaths

Next a table tool was made to see the deaths values and rates. This dataset does not specify any details about the cases (such as age groups, geography, gender, comorbilities). From it we only know the total deaths and those that occurred at a long term care (LTC) centre.

I quick note about our little data table. I haven’t seen too many tables with a range slider, so I thought perhaps its worth making a small experiment to see if it is a good way to interact with data. Currently, I might use this range slider data filtering a bit more often.

Overall, more than half of the deaths occur in LTCs, while the rest occur outside of long term care facilities (in the community or short term in the hospital). Again, we do not know the age groups, but we can assume that LTC residents are generally a bit older.

There are a lot of articles out there that report a variety of death/mortality/fatality rates. Again, I am not an epidemiologist so I am not going to dive into the differences between the crude mortality rate, the infection mortality rate, or the case fatality rate. From the SIR and SEIR models that were previously examined, I am going to separate Infected (or Confirmed Positive) and Resolved into two separate groups. Next, the resolved are actually two separate groups: deaths and recovered. So the death rate that we are computing is found by:

\mathsf{Death\ Rate} = \frac{\mathsf{Deaths}}{\mathsf{Resolved}} = \frac{\mathsf{Deaths}}{\mathsf{Recovered} + \mathsf{Deaths}}

Note, that we are not considering the current number of infected. Members of that group will eventually be in the death or the recovered groups, and at point they will be counted, but until then they are not counted and we will not assume what group that they belong to. From this, we look at the total number of deaths and the amount not in a long term care facility.

This article which might be a little older, but it suggests that “Globally, about 3.4% of reported COVID-19 cases have died”. Our non-LTC death rate is rather close to this value. I am not sure if its valid to exclude such a large group, or if the LTC group should be considered a separate group and have a separate mortality rate.

Daily Changes

It is perhaps easier to see if things are getting better or worse by looking at the daily changes. For a few particular categories, the daily changes for the past three weeks have been tracked. A few things to note:

The same colours that were used for the first graph are also being used here. This is the reason why some of the colours repeat, that is being consistent to avoid confusion.
Tool tips are drawn when the mouse cursor is placed over a bar
A value of zero results in a bar that is one pixel in height. This is to allow the user to select it (so a tool tip can be shown). Also if nothing was shown, it would be quite confusing for the user.
Like the previous graph and table, was made in JS + D3. It gives the ability to animate the graph during a transition.

Click here to see the data on a separate page without any wordpress decorations.

New Cases

Lately, various news sources have focused a lot of attention on the number of ‘new cases’ that occur each day. Quite often the value is in the headline of the top story (such as here and here). While the amount of new cases is important, it is a combination of values and with the above tool can be examined a bit further.

First the math… in the Ontario dataset, the total cases is computed as:

\mathsf{Total\ Cases} = \mathsf{Confirmed\ Positive} + \mathsf{Resolved} + \mathsf{Deaths}

Next the new cases is computed by summing the changes in the three terms that it is composed of, that is:

\mathsf{New\ Cases} = \Delta(\mathsf{Confirmed Positive}) + \Delta\mathsf{Resolved} + \Delta\mathsf{Deaths}

Looking at each of these terms separately we have:

$\Delta\mathsf{Resolved}$ : since the amount of resolved cases should always be increasing, then this value should be either zero or positive. Also the larger the value the better.
$\Delta\mathsf{Deaths}$ : just like resolved, the amount of deaths can only increase, so the value can only be zero or positive. Obviously, the smaller the value the better
$\Delta\mathsf{(Confirmed\ Positive)}$ : or the change in the number of infections. This value can be either positive (an increase in the number of infections), negative (a decrease in the number of infections), or zero (for no change). At a simplistic level, we can label positive values as bad and negative ones as good.

Therefore, the New Cases value is composed of terms that represent both good and bad information. If one considers the following:

May 26, 2020: New Cases 287
May 27, 2020: New Cases 292

One can believe that these values are approximately the same and things are fine. However looking into these values we have:

May 26, 2020: New Infections: 6, Cases Resolved: 260, Additional Deaths: 21
May 27, 2020: New Infections: -154, Cases Resolved: 414, Additional Deaths: 32

Looking at these values, it is pretty obvious see that May 27th was a better day than the 26th. A lot of articles do report these values but they are often buried in the text’s body. Hence, simply reporting New Cases in a headline title or tweet can be misleading.

Perhaps in the future, the number of new infections should be the headline number instead of the number of new infections.

Implementation in JS + D3

The implementations for all of the graphs and table used above is available at this GitHub repository. Since very little calculations and data manipulations need to be done, everything was performed in regular Javascript (that is any other code in python was not included since it doesn’t contribute anything too significant).

Covid dataVisualization Python