In this blog post, we are going to analyse again the Covid-19 data provided by the Johns Hopkins University. This time we are going to focus on trends and correlations between some of the values provided in this data.
The data provided by the Johns Hopkins University contains some interesting fields which we did not cover in previous blogs:
- Case fatality ratio: according to Wikipedia, this is "the proportion of deaths from a certain disease compared to the total number of people diagnosed with the disease for a particular period". This value is not a constant and varies according to region and also time.
- Incidence rate: number of cases per 100,000 persons
The Data Structure
We have used a Google Collab Jupyter notebook to process the data. The data consists of around 250 CSV files like this one and are very easy to process using Pandas. These are the available columns in all of them:
'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update', 'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active', 'Combined_Key', 'Incidence_Rate', 'Case-Fatality_Ratio'
The field description can be found on this page.
The data stored in the CSV files looks like this:
To create a dataset, that can be accessed as a whole; we need to concatenate first all of CSV files into a moderately big dataset.
Biggest Data Contributors
It seems that the USA is the country that contributes most data to this dataset:
But other countries are also relatively well represented:
Trend and Correlation Plotting
We started by creating a plot for each of the nations of the United Kingdom. And we used the following fields:
- Daily case count - this field was calculated from the data. It is not an original field.
- Incidence rate
- Case fatality ratio
- Daily death count - also not an original field.
We are going to add to the daily case count, case fatality ratio also some lines that show the trend and the result of plain linear regression in a period of roughly four months. We have used a Python library with trend tests: pymannkendall 1.4.1 and one of the lines displayed in the plots represents the result of the trend test.
So the plot for the daily case count looks like this for England (United Kingdom):
Here you can see an increasing trend (even though the linear regression line at around 2000 cases per day is relatively flat).
Then continuing with England, we have the incidence rate in a simple plot:
And the fatality rate:
Now here we start to see that we have a decreasing trend as opposed to the daily case count, which was an increasing trend.
The daily deaths are also trending downwards, albeit the slope is not so steep in this case.
And then we create a simple correlation matrix table and plot that displays the Pearson correlation for each field in regard to every other field:
A value close to one expresses positive linear correlation, i.e. the data in both fields tends to have similar movement in the same direction at the same time. A value close to zero expresses no correlation and close to -1 expresses negative linear correlation. Negative correlation is associated with diverging movement of both data fields.
So what do we see in this table: the daily cases and the incidence rate have high (0.76) positive correlation, but the daily cases and the case fatality ration have a relatively strong negative correlation (almost -0.7) for the period starting on 1st June till today (24th September). The strongest correlation is actually the correlation between the case fatality ratio and the incidence rate (-0.98), which suggests that more people are getting infected in terms of the population, but at the same time the number of deaths per Covid-19 case are going down.
We have depicted this matrix again using Seaborn:
We repeated this exercise also for Scotland, Wales and Northern Ireland. Here are some of the plots for Scotland:
And the correlation matrix:
The picture for Scotland compared to England is very similar with one difference: the daily deaths displays no trend.
Wales also has similar curves compared to England, but I got a surprise when the Mann Kendall test told me that there is no trend on the daily cases, even though the regression and trend slopes point up:
If you open the Collab Notebook, you will find that Northern Ireland and Scotland have very similar patterns.
USA and Other Countries
We also created the same plots for some states of the United States of America.
The USA overall seems also to have a strong negative correlation between the case fatality rate and the incidence rate, like the United Kingdom:
Here are the plots for Florida:
Overall I got the impression that the USA has more diversity in terms of its trends compared to the United Kingdom, but that is not surprising as it is a bigger country.
We also looked at other countries like Germany, Brazil and India, but you can have a look at the Google Collab Jupyter notebook.
World Wide trend Counting
At the end of this exercise, we printed out all of the plots for the whole countries (looks a bit messy) and counted the amounts of trends, i.e. how many countries have which trends in daily cases and case fatality rate. This is what we came up with:
Python has really great libraries for statistics and plotting, and we feel we are just barely scratching the surface of what you can do creatively with these libraries.
Regarding the data observations, the most striking phenomenon that we saw, was in the United Kingdom data, especially the strong negative correlation between case fatality ratio and the incidence rate. This is a temporary observation that might well change in the months to come, but as of now, this correlation is consistent in this country. The USA also has overall this negative correlation, albeit to a weaker degree. This correlation is not universal, though; Australia has a positive correlation between these two values: