Chapter 4 Modelling

Have you ever wondered what to do when you want to predict more than 2 categories, and these categories seem to be ordered? Ordinal logistic regression is the way to go! In the case of pet adoption speeds, we have 5 categories, with group 0 being the fastest adoption speed, 5 being the slowest.

For ordered logistic regression, we assume that the relationship between each pair of outcome groups is the same. This is called the proportional odds assumption. In other words, the coefficient for variable A will be the same regardless of whether it’s describing the relationship between outcome group 1 and 2, or outcome group 2 and 3.

4.1 Set Up

First, let’s load some essential package.

We will also create a sample of 1000 for faster modelling.

load("adoption.RData")
set.seed(454)
mysample <- adoption[sample(1:nrow(adoption), 1000,
   replace=FALSE),]

Below is the function I wrote to compute the accuracy of each model.

ordinal_accuracy<-function(post_preds,mydata){
  post_preds<-as.data.frame(post_preds)
  results<-c()
  for (j in (1:length(post_preds))){
    results[j]<-as.numeric(tail(names(sort(table(post_preds[,j]))))[5])
    }
  results<-as.data.frame(results)
  compare<-cbind(results,mydata$AdoptionSpeed)
  compare<-compare %>%mutate(results=as.numeric(results))
  compare<-compare %>% mutate(`mydata$AdoptionSpeed`=as.numeric(`mydata$AdoptionSpeed`))
  compare<-compare %>%mutate(accuracy=ifelse(as.numeric(results)==as.numeric(`mydata$AdoptionSpeed`),1,0))
  print(sum(compare$accuracy)/length(post_preds))
}

4.2 Model 1

From the “Starting off” section, we see that health factors and type can be useful factors in predicting adoption speed.

4.2.1 Model Building

\[\text{For Bayesian Ordinal Regression, we introduce a latent variable }y^*,\text{ modeled as a linear function with our chosen predictors,} \\\text{ the vector }\zeta\text{ with 4 cutpoints, and let }Y_i\text{ be the ordinal outcome for the ith animal with 5 possible adoption speeds.}\] \[\\Y_i|\zeta_1,...,\zeta_4,\beta_1,...,\beta_6 = \left\{ \begin{array}{ll} 0 & \quad y^* < \zeta_1 \\ 1 & \quad \zeta_1 \leq y^* < \zeta_2 \\ 2 & \quad \zeta_2 \leq y^* < \zeta_3 \\ 3 & \quad \zeta_3 \leq y^* < \zeta_4\\ 4 & \quad y^*\ge \zeta_4 \end{array} \right.\] \[y^*=\beta_1x_1+\beta_2x_2+...+\beta_6x_6\]where \(x_n\) for n from 1 to 6 are the indicators for cat, minor injury, severe injury, sterilization, vaccination, and dewormed. And \(\beta_n\) are the coefficients for these indicators. \[\zeta_1 \sim N(m_{01},s_{01}^2) \\... \\\beta_1 \sim N(m_1,s_1^2) \\...\]

For this model, we will be setting a R2 prior, which is the proportion of variance in the outcome that is attributable to the coefficients in our linear model. Since we don’t have any prior information, we will set a standard uniform prior on R2.

model1 <- stan_polr(AdoptionSpeed ~ Type+Health+Sterilized+Vaccinated+Dewormed, data =mysample, prior=R2(0.5,what=NULL),iter=5000, seed = 454)

4.2.2 Posterior Inference

model1_summary<-summary(model1)
head(as.data.frame(model1_summary), -2)

##                     mean         mcse          sd         10%          50%
## Type1       -0.291472332 1.067107e-03 0.128917172 -0.45490997 -0.291619825
## Health1      0.441523473 3.135560e-03 0.350471059 -0.01079189  0.437815394
## Health2      2.289862443 1.401822e-02 1.356098798  0.68125324  2.195511682
## Sterilized1  0.690688772 1.535365e-03 0.164162835  0.48092258  0.688819102
## Vaccinated1  0.318715747 1.655190e-03 0.179963272  0.08924947  0.318538724
## Dewormed1   -0.001789522 1.616438e-03 0.168594313 -0.21720001 -0.001004759
## 0|1         -3.355148548 2.201978e-03 0.232536209 -3.65773490 -3.349498499
## 1|2         -1.058289904 1.284121e-03 0.139953255 -1.23837624 -1.057180267
## 2|3          0.276656461 1.187258e-03 0.133952418  0.10538132  0.276139337
## 3|4          1.272267361 1.207162e-03 0.140589158  1.09125504  1.271952833
## mean_PPD:0   0.031505172 8.265987e-05 0.008632395  0.02093596  0.030788177
## mean_PPD:1   0.203957020 1.955397e-04 0.019572049  0.17980296  0.203201970
## mean_PPD:2   0.288257266 2.204405e-04 0.022416216  0.25985222  0.288177340
## mean_PPD:3   0.214343966 1.973403e-04 0.020048851  0.18965517  0.214285714
##                     90% n_eff      Rhat
## Type1       -0.12640415 14595 0.9999596
## Health1      0.89662335 12493 0.9997813
## Health2      4.02976553  9358 0.9999248
## Sterilized1  0.90384820 11432 1.0000488
## Vaccinated1  0.54993335 11821 0.9998534
## Dewormed1    0.21453193 10878 0.9998297
## 0|1         -3.06134550 11152 1.0000541
## 1|2         -0.87929294 11878 0.9998446
## 2|3          0.44947033 12729 1.0000083
## 3|4          1.45397384 13564 1.0001160
## mean_PPD:0   0.04310345 10906 0.9999094
## mean_PPD:1   0.23029557 10018 0.9999413
## mean_PPD:2   0.31650246 10341 1.0001028
## mean_PPD:3   0.24014778 10322 1.0000672

set.seed(454)
model_data1<-mysample %>% 
  dplyr::select(AdoptionSpeed, Type,Health,Sterilized,Vaccinated,Dewormed) %>% 
  na.omit()
my_prediction1 <- posterior_predict(
  model1, 
  newdata = model_data1)

ordinal_accuracy(my_prediction1,model_data1)

## [1] 0.2438424

The formula using the posterior means of each variable is: \[y^*=-0.291*Type+0.442*Health1+2.290*Health2+0.691*Sterilized1 \\+0.319*Vaccinated1-0.002*Dewormed1 \\{\zeta_1,...\zeta_4}={0,-3.62,-1.42,-0.17,0.76}\text{ respectively}\] The accuracy for this model is 0.244. If an animal is a cat, then \(y^*\) will decrease by 0.291, making the animal more likely to get adopted faster than a dog, keeping all other variables constant. We can see that the mean of Dewormed is near zero, therefore, we will exclude Dewormed in our future model.

4.3 Model 2

With further investigations, we have also found “AgeGroup” and “mix_breed” to be important factors (visualizations from the “More digging” section). We will include these in our models.

4.3.1 Model Building

\[\text{Let }y^*\text{ be the latent variable, }\zeta\text{ be the vector with 4 cutpoints, and }Y_i\text{ be the } \\\text{ ordinal outcome for the ith animal with 5 possible adoption speeds.}\] \[\\Y_i|\zeta_1,...,\zeta_4,\beta_1,...,\beta_{10} = \left\{ \begin{array}{ll} 0 & \quad y^* < \zeta_1 \\ 1 & \quad \zeta_1 \leq y^* < \zeta_2 \\ 2 & \quad \zeta_2 \leq y^* < \zeta_3 \\ 3 & \quad \zeta_3 \leq y^* < \zeta_4\\ 4 & \quad y^*\ge \zeta_4 \end{array} \right.\] \[y^*=\beta_1x_1+\beta_2x_2+...+\beta_{11}x_{11}\] where \(x_n\) for n from 1 to 11 are the indicators for cat, minor/severe injury, sterilization, vaccination, age groups 1-5, and Mix breed. \[\zeta_1 \sim N(m_{01},s_{01}^2) \\... \\\beta_1 \sim N(m_1,s_1^2) \\...\]

model2 <- stan_polr(AdoptionSpeed ~ Type+Health+Sterilized+Vaccinated+AgeGroup+MixedBreed, data =mysample, prior=R2(0.5,what=NULL),iter=5000, seed = 454)

4.3.2 Posterior Inference

model2_summary<- summary(model2)
head(as.data.frame(model2_summary), -2)

##                    mean         mcse          sd         10%         50%
## Type1        0.48778366 1.707863e-03 0.221427458  0.19976369  0.48937164
## Health1      0.47969467 3.195916e-03 0.348317802  0.03818016  0.47965636
## Health2      1.67486939 1.220502e-02 1.279389983  0.13553622  1.58676636
## Sterilized1  0.47813952 1.457010e-03 0.181320803  0.24420605  0.47770227
## Vaccinated1  0.24826375 1.199002e-03 0.149064908  0.05697017  0.24762385
## AgeGroup1    0.54348134 1.459234e-03 0.170443602  0.32467603  0.54231742
## AgeGroup2    0.66716931 1.993412e-03 0.227959035  0.37680824  0.66989071
## AgeGroup3    0.96625893 3.180660e-03 0.345682844  0.51766283  0.95983629
## AgeGroup4    0.65239525 3.793133e-03 0.437359847  0.08932112  0.64813984
## AgeGroup5    0.53430605 5.266733e-03 0.592109782 -0.22336947  0.52774019
## MixedBreed1  1.06601710 2.044574e-03 0.228975968  0.77289049  1.06582076
## 0|1         -2.45032288 2.856396e-03 0.287372279 -2.81857404 -2.44775572
## 1|2         -0.13451482 2.159830e-03 0.230288981 -0.43149332 -0.13648189
## 2|3          1.23866701 2.011140e-03 0.232081787  0.93921284  1.23811248
## 3|4          2.26420168 2.033693e-03 0.241510274  1.95486028  2.26515793
## mean_PPD:0   0.03138345 7.841415e-05 0.008449069  0.02068127  0.03041363
## mean_PPD:1   0.20154197 1.840315e-04 0.019143089  0.17761557  0.20072993
## mean_PPD:2   0.28724696 2.128690e-04 0.022075926  0.25912409  0.28710462
## mean_PPD:3   0.21470985 1.891734e-04 0.019819265  0.18978102  0.21411192
##                     90% n_eff      Rhat
## Type1        0.76985933 16810 0.9998001
## Health1      0.92512579 11878 0.9998389
## Health2      3.31114337 10988 0.9999712
## Sterilized1  0.70659244 15487 0.9997555
## Vaccinated1  0.43924717 15457 0.9997827
## AgeGroup1    0.76354356 13643 1.0000624
## AgeGroup2    0.95776625 13077 0.9998310
## AgeGroup3    1.41472859 11812 0.9996562
## AgeGroup4    1.21477049 13295 0.9997109
## AgeGroup5    1.29127895 12639 0.9998504
## MixedBreed1  1.35920999 12542 0.9997835
## 0|1         -2.08201004 10122 0.9998176
## 1|2          0.16165719 11369 0.9997512
## 2|3          1.53309341 13317 0.9998544
## 3|4          2.57697695 14103 0.9998261
## mean_PPD:0   0.04257908 11610 0.9998802
## mean_PPD:1   0.22627737 10820 1.0000967
## mean_PPD:2   0.31630170 10755 0.9999731
## mean_PPD:3   0.23965937 10976 0.9998674

set.seed(454)
model_data2<-mysample %>% 
  dplyr::select(AdoptionSpeed, Type,Health,Sterilized,Vaccinated,AgeGroup,MixedBreed) %>% 
  na.omit()
my_prediction2 <- posterior_predict(
  model2, 
  newdata = model_data2)

ordinal_accuracy(my_prediction2,model_data2)

## [1] 0.2372263

The model accuracy is 0.24. If an animal has a severe injury, then \(y^*\) will increase by 2.290, making it more likely to get adopted slower than a healthy animal, keeping all other variables constant. We can see that the mean of Dewormed is near zero, therefore, we will exclude Dewormed in our future model. If an animal is in age group 1(4 to 11 months), then \(y^*\) will increase by 0.5, making the animal more likely to get adopted slower than an animal in age group 0, keeping all other variables constant. Similar situations apply to other age groups as well. From this, we can reasonably infer that animals in age group 0 are the most popular. If an animal is a Mixed Breed, then \(y^*\) will increase by 1.1, making them more likely to get adopted slower than an animal that is not mixed breed, keeping all other variables constant.

4.4 Making some adjustments

Our accuracy is less than ideal. One reason behind this could be due to how adoption speed is grouped. Group 0 is being adopted the day of, group 1 is being adopted between 2-7 days, and group 2 is being adopted between 8-30 days. The difference between group 0,1, and 2 can be due to chance. Therefore, I’ve decided to group groups 0,1, and 2 into 1 group. In our future models, we will be predicting for 3 adoption speed groups.

4.5 Model 3

In this model, we will be adding two interaction terms. The figure below shows that the younger the animal is, the less likely it will get sterilized, since the procedure can’t be done when they are young. I have also realized that only dogs can be marked as mixed breed. Therefore, we will add two interaction terms: Type\(*\)MixedBreed and AgeGroup\(*\)Sterilization.

set.seed(454)
mysample2 <- adoption[sample(1:nrow(adoption), 1000,
   replace=FALSE),]

ordinal_accuracy2<-function(post_preds,mydata){
  post_preds<-as.data.frame(post_preds)
  results<-c()
  for (j in (1:length(post_preds))){
    results[j]<-as.numeric(tail(names(sort(table(post_preds[,j]))))[3])
    }
  results<-as.data.frame(results)
  compare<-cbind(results,mydata$AdoptionSpeed_Group)
  compare<-compare %>%mutate(results=as.numeric(results))
  compare<-compare %>% mutate(`mydata$AdoptionSpeed_Group`=as.numeric(`mydata$AdoptionSpeed_Group`))
  compare<-compare %>%mutate(accuracy=ifelse(as.numeric(results)==as.numeric(`mydata$AdoptionSpeed_Group`),1,0))
  print(sum(compare$accuracy)/length(post_preds))
}

4.5.1 Model Building

\[\text{Let }y^*\text{ be the latent variable, }\zeta\text{ be the vector with 2 cutpoints, and }Y_i\text{ be the } \\\text{ ordinal outcome for the ith animal with 3 possible adoption speeds.}\] \[Y_i|\zeta_1,\zeta_2,\beta_1,...,\beta_{17} = \left\{ \begin{array}{ll} 1 & \quad y^* < \zeta_1 \\ 2 & \quad \zeta_1 \leq y^* < \zeta_2 \\ 3 & \quad \zeta_2 \leq y^* \end{array} \right.\] \[ y^*=\beta_1x_1...+\beta_{10}x_{10}+\beta_{11}x_1*x_10+\beta_{12}x_2*x_5+...+\beta_{17}x_2*x_9\] where \(x_n\) for n from 1 to 11 is the indicator for cat, minor/severe injury, sterilization, vaccination, age groups 1-5, and Mix breed.\(\beta_{12}\) is the coefficient for the interaction term of \(Type*MixedBreed\), and \(\beta_{13}\) to \(\beta_{17}\) are the coefficients for the interaction terms of \(AgeGroup*Sterilization\). \[\\\zeta_1 \sim N(m_{01},s_{01}^2) \\... \\\beta_1 \sim N(m_1,s_1^2) \\...\]

model3 <- stan_polr(AdoptionSpeed_Group ~ Type+MixedBreed+Health+Vaccinated+AgeGroup+Sterilized, data=mysample2, prior=R2(0.5,what= NULL),iter=5000, seed = 454)

4.5.2 Posterior Inference

model3_summary<- summary(model3)
head(as.data.frame(model3_summary), -2)

##                  mean         mcse         sd         10%       50%       90%
## Type1       0.3451827 0.0019710823 0.23476422  0.04353292 0.3425949 0.6421678
## MixedBreed1 0.9758830 0.0022277932 0.24084355  0.66773156 0.9752728 1.2843762
## Health1     0.5549085 0.0031939476 0.35711011  0.09364087 0.5573441 1.0099537
## Health2     1.6569184 0.0128234459 1.29691140  0.12056026 1.5538236 3.3182407
## Vaccinated1 0.1700844 0.0012275795 0.16079859 -0.03401840 0.1706597 0.3776179
## AgeGroup1   0.5505604 0.0016178448 0.18017912  0.31706524 0.5493472 0.7840974
## AgeGroup2   0.7259747 0.0019790205 0.23643728  0.42795142 0.7253973 1.0323629
## AgeGroup3   0.9113914 0.0030942282 0.35135824  0.46744848 0.9097726 1.3594050
## AgeGroup4   0.5862925 0.0040405589 0.47104174 -0.01181623 0.5814861 1.1957654
## AgeGroup5   0.3243656 0.0055521388 0.64586317 -0.51509322 0.3412998 1.1392635
## Sterilized1 0.5207101 0.0016653392 0.18600371  0.28016222 0.5200531 0.7599464
## 1|2         1.1190016 0.0022699191 0.24307702  0.80645532 1.1160157 1.4277635
## 2|3         2.1449649 0.0022639660 0.25308847  1.82145025 2.1405737 2.4761848
## mean_PPD:1  0.5203933 0.0002372490 0.02376417  0.49026764 0.5206813 0.5510949
## mean_PPD:2  0.2144968 0.0002032269 0.02032452  0.18856448 0.2141119 0.2408759
##             n_eff      Rhat
## Type1       14186 0.9997899
## MixedBreed1 11687 0.9999682
## Health1     12501 1.0000555
## Health2     10228 1.0002430
## Vaccinated1 17158 0.9996834
## AgeGroup1   12403 1.0002466
## AgeGroup2   14274 0.9998599
## AgeGroup3   12894 0.9997617
## AgeGroup4   13591 0.9999414
## AgeGroup5   13532 0.9999226
## Sterilized1 12475 1.0003604
## 1|2         11467 0.9998890
## 2|3         12497 0.9998815
## mean_PPD:1  10033 1.0001560
## mean_PPD:2  10002 0.9998024

set.seed(454)
model_data3<-mysample2 %>%
  dplyr::select(AdoptionSpeed_Group, Type,Health,Sterilized,Vaccinated,AgeGroup,MixedBreed) %>%
  na.omit()
my_prediction3 <- posterior_predict(
  model3,
  newdata = model_data3)

ordinal_accuracy2(my_prediction3,model_data3)

## [1] 0.5596107

The model accuracy is 0.560, that’s a big improvement! From the model summary, we notice that 0 is about one standard deviation away from the mean of vaccination’s coefficient. Therefore, we will exclude vaccination in our final model.