Further Statistical Examination of the Sanders/Clinton Exit Polling Paper

At work today I noticed some tweets talking about a paper which demonstrates that the Democratic primary election was stolen from Bernie Sanders. This piqued my interest, being a fan of stats and I followed the links to find the paper by Axel Geijsel and Rodolfo Cortes Barragan which compares various metrics and exit polls to show that states without a "paper trail" were more likely to support Clinton.

I read the study and my first reaction was a raised eyebrow for two reasons.

The first is theoretical. Exit polling is an inexact but important process. If there are distortions in the sampling process of the poll it can lead to quite different results from the final tally. There have been some notable examples from the UK such as the infamous "Shy Tory" problem where the Conservative support in the UK elections was constantly under-estimated by the exit polls well outside the margin of error. The reason behind these errors is the fact that the margin of error is not gospel. It assumes that the sample is representative of the voting population. In the UK, the tendency of Labour supporters to harangue their Tory counterparts meant that Conservatives were "shy" and more likely to lie on the exit polls. As a result the assumption supporting the margin of error was violated. There have been other documented cases in various other elections around the world. This isn't to say exit polls are always inaccurate or useless, far from it, but they are imperfect tools and I am sure there will be a series of post postmortems to discuss why there have been errors outside the confidence interval this time around.

That being said, the sampling argument is largely theoretical. The paper by Geijsel and Barragan delves into the numbers. The central variable for the authors is the distinction between paper trail and not having a paper trail based on Ballotpedia. My political science chops are almost six years out of date and I have no reason to question the distinction. But I was curious about whether there may be intervening variables that could influence the study.

To their credit the authors published their data, so I grabbed a CSV version of their data set that showed support for Clinton versus not and slammed it into R.

ep<-read.csv('exit-polls.csv') #Read the data
head(ep) #Take a look see

##         State Support.for.Clinton.in.Exit.Polls
## 1     Alabama                             73.16
## 2     Arizona                             37.00
## 3    Arkansas                             66.02
## 4 Connecticut                             51.64
## 5     Florida                             63.96
## 6     Georgia                             65.72
##   Support.for.Clinton.in.Results    Paper.Trail
## 1                          77.84    Paper Trail
## 2                          57.63    Paper Trail
## 3                          66.28 No Paper Trail
## 4                          51.80    Paper Trail
## 5                          64.44 No Paper Trail
## 6                          71.33 No Paper Trail

Looking good, there are several models within the paper, the comparison between results and exit polls I'm not going to substantially explore in this post because a) it's late and b) the exit polls sampling question remains relatively open. I don't disagree with the general premise of the results that Clinton tended to out perform her exit polls. I'm more curious as to why.

The question is are these instances of out performance systemically related the presence or absence of a paper trail. Let's start by looking at the difference between the results and the exit polls by difference the two from each other. From there a simple two sample t-test will say if there is a statistically significant average discrepancy in states with or without a paper trail.

ep$diff <- ep$Support.for.Clinton.in.Results - ep$Support.for.Clinton.in.Exit.Polls #Difference the two polling numbers
head(ep)

##         State Support.for.Clinton.in.Exit.Polls
## 1     Alabama                             73.16
## 2     Arizona                             37.00
## 3    Arkansas                             66.02
## 4 Connecticut                             51.64
## 5     Florida                             63.96
## 6     Georgia                             65.72
##   Support.for.Clinton.in.Results    Paper.Trail  diff
## 1                          77.84    Paper Trail  4.68
## 2                          57.63    Paper Trail 20.63
## 3                          66.28 No Paper Trail  0.26
## 4                          51.80    Paper Trail  0.16
## 5                          64.44 No Paper Trail  0.48
## 6                          71.33 No Paper Trail  5.61

t.ep <- t.test(diff~Paper.Trail, data = ep) #Compare the means
t.ep #Overlap

## 
##  Welch Two Sample t-test
## 
## data:  diff by Paper.Trail
## t = -0.40876, df = 16.726, p-value = 0.6879
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.062003  2.744860
## sample estimates:
## mean in group No Paper Trail    mean in group Paper Trail 
##                     2.750000                     3.408571

And to visualize the differences

library(ggplot2) #For pretty graphs! <3 u Hadley
ggplot(ep, aes(x = diff, fill = Paper.Trail)) + geom_histogram(color = 'black') + xlab('Difference') + ylab('Count') #Histogram

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

<!-- -->

ggplot(ep, aes(y = diff, x = Paper.Trail, fill = Paper.Trail)) + geom_boxplot() #Boxplot

<!-- -->

Woah, Arizona was wayyyyy off. While this state is listed as having a paper trail the election was quite a mess. Most commentators have associated this mess with the Republican State Government and both Clinton and Sanders sued over the results. Let's strike that case and re-run the results to see if they change. Still, with this first batch there is no statistically significant difference in the gap between results and the exit polls across the two classes of states (based on this data).

ep.no.az <- subset(ep, State != 'Arizona') #Leaving Arizona out
t.ep.2 <- t.test(diff ~ Paper.Trail, data = ep.no.az) #Redo
t.ep.2 #Nada

## 
##  Welch Two Sample t-test
## 
## data:  diff by Paper.Trail
## t = 0.69322, df = 20.746, p-value = 0.4959
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.333749  2.666057
## sample estimates:
## mean in group No Paper Trail    mean in group Paper Trail 
##                     2.750000                     2.083846

ggplot(ep.no.az, aes(x = diff, fill = Paper.Trail)) + geom_histogram(color = 'black') + xlab('Difference') + ylab('Count')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

<!-- -->

ggplot(ep.no.az, aes(y = diff, x = Paper.Trail, fill = Paper.Trail)) + geom_boxplot()

<!-- -->

Still no significant result.

In their appendix the authors also present a regression model that controls for the proportion of Latino/Hispanic individuals in a state and the relative "blueness" of the state as well. The author's didn't present raw data for this particular model so I can't replicate the blueness factor of the state without scraping a bunch of data, and as I said, it is late and I have to work tomorrow. However I did find a population breakdown from the Kaiser Foundation, a well respected health policy institute.

I was a little confused why the authors only controlled for the Hispanic population of a state. A significant trend in the election was Sanders' support among the White population while Clinton tended to win the African American vote.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

race.data <- read.csv('raw_data.csv', stringsAsFactors = FALSE) #Read the data
race.data[race.data == 'N/A'] <- NA #Turn missing data into a format R likes
race.data$Asian <- as.numeric(race.data$Asian) #Clean up
race.data$Two.Or.More.Races <- as.numeric(race.data$Two.Or.More.Races)
head(race.data)

##        Location White Black Hispanic Asian American.Indian.Alaska.Native
## 1 United States  0.62  0.12     0.18  0.06                          0.01
## 2       Alabama  0.66  0.27     0.04  0.02                          <NA>
## 3        Alaska  0.57  0.02     0.09  0.10                          0.16
## 4       Arizona  0.49  0.04     0.39  0.04                          0.03
## 5      Arkansas  0.72  0.16     0.07    NA                          0.01
## 6    California  0.39  0.05     0.38  0.15                          0.01
##   Two.Or.More.Races Total
## 1              0.02     1
## 2              0.01     1
## 3              0.07     1
## 4              0.01     1
## 5              0.02     1
## 6              0.02     1

race.data[,c('White','Black','Hispanic','Asian', 'Two.Or.More.Races')] <- race.data[,c('White','Black','Hispanic','Asian', 'Two.Or.More.Races')]*100 #Rescale so that the regression coefs are expressed as per one percentage point change
head(race.data)

##        Location White Black Hispanic Asian American.Indian.Alaska.Native
## 1 United States    62    12       18     6                          0.01
## 2       Alabama    66    27        4     2                          <NA>
## 3        Alaska    57     2        9    10                          0.16
## 4       Arizona    49     4       39     4                          0.03
## 5      Arkansas    72    16        7    NA                          0.01
## 6    California    39     5       38    15                          0.01
##   Two.Or.More.Races Total
## 1                 2     1
## 2                 1     1
## 3                 7     1
## 4                 1     1
## 5                 2     1
## 6                 2     1

combo.data <- left_join(ep, race.data, by=c('State'='Location')) #join

## Warning in left_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector

paper.only.mod<-lm(Support.for.Clinton.in.Results ~  Paper.Trail + Hispanic, data = combo.data) #OG model - blueness
summary(paper.only.mod) #Paper trail checks in 

## 
## Call:
## lm(formula = Support.for.Clinton.in.Results ~ Paper.Trail + Hispanic, 
##     data = combo.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.599  -2.628  -0.073   5.428  27.208 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             63.0879     5.6660  11.135 1.67e-09 ***
## Paper.TrailPaper Trail -13.0071     5.8651  -2.218   0.0397 *  
## Hispanic                 0.1379     0.2824   0.488   0.6313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.26 on 18 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.2295, Adjusted R-squared:  0.1439 
## F-statistic:  2.68 on 2 and 18 DF,  p-value: 0.09573

That a version of the original model, although admittedly lacking the control for blueness. It shows a significant negative effect similar to that in the appendix of the Geijsel and Barragan paper. However when we add the other major racial categories into the mix the results shift

fin.mod<-lm(Support.for.Clinton.in.Results ~  Paper.Trail + White + Black + Hispanic + Asian, data = combo.data, na.action = na.exclude) #refit
summary(fin.mod) #nada

## 
## Call:
## lm(formula = Support.for.Clinton.in.Results ~ Paper.Trail + White + 
##     Black + Hispanic + Asian, data = combo.data, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.436  -2.942   1.169   5.229   9.413 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)            80.79725   77.79192   1.039    0.318
## Paper.TrailPaper Trail -3.17326    4.12438  -0.769    0.455
## White                  -0.48548    0.79919  -0.607    0.554
## Black                   0.90971    0.80976   1.123    0.282
## Hispanic                0.04154    0.83613   0.050    0.961
## Asian                  -0.94225    1.10919  -0.849    0.411
## 
## Residual standard error: 7.888 on 13 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.7563, Adjusted R-squared:  0.6626 
## F-statistic: 8.069 on 5 and 13 DF,  p-value: 0.001174

combo.data$pred <- predict(fin.mod) 
ggplot(combo.data, aes(y = Support.for.Clinton.in.Results, x = Black, shape = Paper.Trail)) + geom_point(color = 'red') + geom_point(aes(y = pred, ), color = 'blue') + ggtitle("Predicted vs Actual Results, Red = Actual, Blue = Predicted") + xlab("% of Black Voters in State")

## Warning: Removed 3 rows containing missing values (geom_point).

## Warning: Removed 5 rows containing missing values (geom_point).

<!-- -->

Two things to note, first, the effect of their being a paper trail become statistically insignificant. Second while nothing else is significant the coefficients pass the smell test based on what we know about the election. Additionally the Adjusted R-Squared, which is a crude metric for the fit of the model is much higher than the version that did not feature the rate of African Americans.

The lack of significance is not particularly surprising given the small sample size here. Even still several observations were dropped due to incomplete demographic data. Let's re-run the model with only the Black and Latino populations as they are the only groups which have complete datasets from Kaiser Foundation. Additionally there is probably a multicolinearity issue because I dumped so many correlated metrics into the regression (the more white people there are in a state, the fewer minorities, multicolinearlity can mess with OLS regression).

black.hispanic.mod<-lm(Support.for.Clinton.in.Results ~  Paper.Trail + White + Black + Hispanic, data = combo.data) #Another model
summary(black.hispanic.mod) #let's see

## 
## Call:
## lm(formula = Support.for.Clinton.in.Results ~ Paper.Trail + White + 
##     Black + Hispanic, data = combo.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.672  -3.143   1.189   3.765  10.780 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             44.9369    58.4083   0.769   0.4529  
## Paper.TrailPaper Trail  -4.5184     3.7705  -1.198   0.2482  
## White                   -0.1028     0.6052  -0.170   0.8672  
## Black                    1.1859     0.6441   1.841   0.0842 .
## Hispanic                 0.3523     0.6816   0.517   0.6123  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.568 on 16 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7211 
## F-statistic: 13.93 on 4 and 16 DF,  p-value: 4.433e-05

library(car)
vif(black.hispanic.mod)

## Paper.Trail       White       Black    Hispanic 
##    1.276514   20.982051   12.289759   17.998073

The high VIF, variable inflation factor means we have a real issue with multicolinearity, this could suppress some effects. Let's drop the metric for white voters as it has the highest variable inflation factor and simply consider the presence or absence of black or Hispanic voters.

black.hispanic.mod2<-lm(Support.for.Clinton.in.Results ~  Paper.Trail +  Black + Hispanic, data = combo.data) #Another model
summary(black.hispanic.mod2) #let's see

## 
## Call:
## lm(formula = Support.for.Clinton.in.Results ~ Paper.Trail + Black + 
##     Hispanic, data = combo.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.853  -3.279   1.014   4.306  10.446 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             35.0592     5.3615   6.539 5.06e-06 ***
## Paper.TrailPaper Trail  -4.3407     3.5174  -1.234   0.2340    
## Black                    1.2895     0.1999   6.450 5.99e-06 ***
## Hispanic                 0.4645     0.1645   2.824   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.349 on 17 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7765, Adjusted R-squared:  0.737 
## F-statistic: 19.68 on 3 and 17 DF,  p-value: 9.047e-06

library(car)
vif(black.hispanic.mod2)

## Paper.Trail       Black    Hispanic 
##    1.178202    1.256066    1.111632

The findings generally hold up with the complete demographic data set. Namely, once you control for demographics the effect of a state having or not having a paper trail becomes statistically insignificant, running contrary to the reported results in the earlier paper. This fits with what the polling has been saying. States with higher minority population are more likely to support Clinton, and the paper trail variable is statistically insignificant by comparison. While this does not prove that nothing shady took place it does mean that the strong conclusions of the original paper may need to be tempered or re-evaluated.

To sum up, the picture is complicated. The Sanders campaign is an energetic and interesting political force and one that should and will be studied by researchers and policy makers moving forward. However based on the evidence presented in the Geijsel and Barragan paper I am not sure if I agree with their strong claims. Academic peer review is important, and if I was reviewing this paper I'd want to see further modelling and investigation into the data sources. I'm not claiming that the models that I am presenting here are perfect, by no means. As I stated earlier, it is late and I'm drinking a beer writing this as my dog sleeps in my lap. What I am claiming is that the data needs to be unpacked, the issues surrounding sampling need to be explored and further features added to the models before I am personally convinced that this election was stolen. IF you feel differently than more power too you, and I'd be interested in iterating on these models going forward. If you'd like to take a crack at it the code and data files are on my github, otherwise, I'm going to bed.