This post is about exploring pollster data correlation on data provided by one statistical website: fivethirtyeight.com. This website maintains polling trackers like the Donald Trump popularity tracker, which represents an average of approval/disapproval ratings provided by multiple pollsters of the president of the USA. It also turns out that fivethirtyeight.com has a git repository from where you can download the data this site uses.

### The Challenge

So I decided to explore the correlations between the most active pollsters of the Donald Trump popularity tracker. This exploration was done using a Jupyter notebook on Google Collab. I had questions like:

- Are most pollsters synchronised in terms of their observations? If so, to what degree?
- Are there pollsters that have significantly different results?
- Which pollsters converge and which pollsters diverge the most?

### The Technologies

In this Jupyter notebook, we are using popular libraries for our analysis:

### Accessing the Data

The fivethirtyeight.com can be accessed via Git. The data for our experiment is located in this URL, to which a README file in this repository is pointing:

https://projects.fivethirtyeight.com/trump-approval-data/approval_polllist.csv

In our notebook, we simply ingest the data using Pandas:

### The Poll List Data

The approval_pollist.csv file we used contains 14608 rows × 22 columns. The fields of the ingested dataset are:

The **president** column only has one single entry: "Donald Trump" and at the time of this blogpost this is the time range of the data:

2017-Jan-20 to 2020-Sept-04

### Data Preparation

There were three problems we faced initially with the data:

- some pollsters have
**a small number of polls**; so the data on them is too sparse and not suitable to get meaningful results in terms of the correlation analysis. - the data is
**not in a format that is suitable for correlation analysis** - even those pollsters that have lots of polls are not producing data daily, so there will always be some days in which they do not produce polls. I.e. the
**time series will have gaps.**

So we needed to address these problems first before we could start with our pollster correlation analysis.

We started by limiting the amount of data to the last **six months**. We are mostly interested in the polling behaviour during the Corona crisis:

We made sure that the data is sorted by start date:

### Finding Most Frequent Pollsters

The most relevant pollsters are the ones that produced the most polls in the last six months. These are the most active ones:

Here is the ranking:

### Pivoting the Data

To perform the correlation analysis between pollsters, we need to have the start time and end time as keys and move each pollster into a column. This can easily be achieved in Pandas with this code:

This is what the table looks like after pivoting:

So the number of rows shrunk and the amount columns increased significantly.

From the screenshot above you can already see that there are lots of missing data, but we will deal with that later.

### Looking at the Data of Specific Pollsters

Before going to our final correlation analysis, we wanted to visualise the approval, disapproval ratings on a pollster by pollster basis. This was mainly done to check if the data made sense. We created some code to extract and plot the approval and disapproval rating per pollster:

An important note: in the plot generation code, we have used **linear interpolation** to fill the gaps with missing data. This helped our plots to have continuous lines.

Now that we have plot generation code, we generated Donald Trump's approval / disapproval / net approval plots for the most popular pollsters:

- YouGov

note: the green line is just a marker to mark the division between positive and negative.

- Morning Consult

- Rasmussen Reports/Pulse Opinion Research

- Ipsos

- Global Strategy Group/GBAO (Navigator Research)

So it seemed that our data is "plottable" and somehow makes some sense. There are a couple of interesting effects here: the more interpolation was used on a pollster, the smoother the lines have become.

### Correlation

What is correlation? Wikipedia defines it like this:

*--- Quote ---*

*In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related*

*--- End Quote ---*

This means that correlation is a statistical relationship between two vectors of data. A vector of data is a list of values with the same values, like in our case, the approval ratings of Donald Trump by two separate pollsters.

**Correlation values**

The most important thing to know about correlation is what its values actually mean, and these numbers do not apply only to this correlation method:

correlation near 1 | Means the values in the two vectors are closely correlated in a positive sense. So the curves in our plots should go up and down in the same way. |

correlation near 0 | No correlation. The curves in the plots go up and down in random directions |

correlation near -1 | Means the values in the two vectors are closely correlated in a negative sense. So the curves in our plots should go up and down in the opposite way. |

**Correlation methods**

There are different methods to calculate correlation, but we are going to focus on these three, which are directly supported by Pandas:

This is perhaps the most popular correlation method which relies on this formula and is surprisingly easy to implement in Python with Pandas:

The Pearson correlation for a population is the covariance of vectors X and Y divided by the multiplication of the standard deviation of X and Y. A naive Python implementation would be:

Seems to be a variation of **Pearson **which uses the rank instead of the vector values:

This one is also easy to implement in Python with some Pandas:

Based on the number of concordant pairs. Here is the formula:

### Results of the Correlation Analysis

**Correlation Heat Maps**

We are correlating here the approvals, disapproval for Donald Trump per pollster. There are multiple ways to depict the correlations, but one of them is a correlation heat map. Here are the **Pearson** based correlation matrices we generated for the last six months of polls for approval ratings.

or a somewhat larger image:

This is a correlation matrix that displays the correlation of any pollster to any other pollster (including itself). Here are some examples to explain the content of the plot:

Pollster 1 | Pollster 2 | Explanation |

YouGov | Morning Consult | Correlation between YouGov and Morning Consult: 0.62. This is a relatively high positive correlation. |

YouGov | Rasmussen Reports | Correlation between YouGov and Rasmussen Reports / Pulse Opinion Research: -0.19. There is not much correlation here. The value is close to zero. |

Ipsos | Global Strategy Group | Correlation between YouGov and Global Strategy Group: 0.81. This is a high positive correlation, meaning that these two pollsters seem to be in positive agreement most of the time. |

Ipsos | Ipsos | Complete positive agreement: 1.0 . This just means that the values are the same. |

Here is another plot using **Spearman** correlation. It is very similar to the above plot:

And **Kendall** correlation:

The interesting thing about this correlation is that the values are closer to zero.

Other possible ways to display correlations is using **scatter plots**.

**Scatter Plots**

Here are some examples of scatter plots:

There is not much correlation here between Rasmussen and YouGov. See above. If you want to have perfect correlation, then the scatter plot looks like this:

If there is some level of positive correlation, then you see scatter plots like this one:

With Seaborn you can draw all scatter plot combinations in one go:

### Conclusion and Findings

Looking at the correlation plots, we can identify one pollster which correlated badly with all others. If you look closely at this plot, you will see a blueish cross (with a red intersection):

That cross is related to Rasmussen reports and means that its reports over six months do not correlate to all other pollsters. We have also tried to add up correlation coefficients to see which pollsters correlate the least and the most. This resulted in this plot:

So it seems that the pollster that correlates the most is Ipsos and the one that by far correlates the least is Rasmussen.

From the perspective of technology: Pandas is such an amazing software package. It helps you to filter, pivot, interpolate, calculate all correlations and draw all diagrams with the help of Matplotlib. It is easy to use and extremely powerful.