JAGS as a tool for simulation studies

In one of my last posts I hinted to the fact that JAGS could be used for simulation studies. Let’s have a look! So basically simulation studies are very useful if one wants to check if a model one is developing makes sense. The main aim is to assess, if the parameters one has plugged into a data simulation process based on the model could be reasonably recovered by an estimation process. In JAGS it is possible to simulate data given a model and a set of known parameters and to estimate the parameters given the model and the simulated data set. For today’s example I have chosen a simple psychometric latent variable measurement model from classical test theory, which is very well understood and can be easily estimated with most software capable of latent variable modeling. Here it is:

.

The basic idea is that a latent trait is measured by three items ( to ) with different difficulties ( to , different loadings ( and ) and different measurement error variances ( to ). For identification reasons, the loading of item 3 is set to 1. Furher the variance of the trait () has to be estimated from data.

For simulating data with JAGs we have to specify a data block. The model specification is straight forward, note that we have to hand over the true parameters to JAGS, for example with an R script. This is pretty handy because one could design a thorough simulation study for different sample size and parameter combinations. The estimation part has to be specfied in a model block.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 #Simulation data{   for(i in 1:n) { y1[i]~dnorm(mu1.true[i], tau.y1.true); y2[i]~dnorm(mu2.true[i], tau.y2.true); y3[i]~dnorm(mu3.true[i], tau.y3.true);   mu1.true[i]<-lambda1.true*eta.true[i]+beta1.true mu2.true[i]<-lambda2.true*eta.true[i]+beta2.true mu3.true[i]<-eta.true[i]+beta3.true }   tau.y1.true<-pow(sigma.y1.true, -2); tau.y2.true<-pow(sigma.y2.true, -2); tau.y3.true<-pow(sigma.y3.true, -2);   }   # Estimation model{   for(i in 1:n) { y1[i]~dnorm(mu1[i], tau.y1); y2[i]~dnorm(mu2[i], tau.y2); y3[i]~dnorm(mu3[i], tau.y3);   mu1[i]<-lambda1*eta[i]+beta1 mu2[i]<-lambda2*eta[i]+beta2 mu3[i]<-eta[i]+beta3 }   for(i in 1:n) { eta[i]~dnorm(0,tau.eta); }   # Priors for the loadings lambda1~dnorm(0,1.0E-3) lambda2~dnorm(0,1.0E-3) bias_lambda1<-lambda1-lambda1.true   # Priors for the easiness parameters beta1~dnorm(0,1.0E-3) beta2~dnorm(0,1.0E-3) beta3~dnorm(0,1.0E-3) bias_beta1<-beta1-beta1.true   # Hyperprior for latent trait distribution tau.eta<-pow(sigma.eta,-2) sigma.eta~dunif(0,10)   # Priors for the measurement errors tau.y1<-pow(sigma.y1, -2); tau.y2<-pow(sigma.y2, -2); tau.y3<-pow(sigma.y3, -2); sigma.y1~dunif(0,10); sigma.y2~dunif(0,10); sigma.y3~dunif(0,10); }

This may look a bit daunting, but it really took only 15 minutes to set that up.

The simulation is controlled by an R script, which hands over the true parameters to JAGS:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 library(R2jags) library(mcmcplots)   n=500 # number of subjects   # generate 'true' latent trait values eta.true<-rnorm(n, 0, 4) # generate 'true' factor loadings lambda1.true=1 lambda2.true=0.5   beta1.true=1 beta2.true=2 beta3.true=3   # generate 'true' error standard deviations sigma.y1.true=1 sigma.y2.true=0.5 sigma.y3.true=0.75   data<-list("n", "eta.true", "lambda1.true", "lambda2.true", "sigma.y1.true", "sigma.y2.true", "sigma.y3.true", "beta1.true", "beta2.true", "beta3.true")   parameters<-c("lambda1","lambda2", "sigma.eta", "sigma.y1", "sigma.y2", "sigma.y3", "beta1", "beta2", "beta3", "bias_beta1")   output<-jags.parallel(data, inits= NULL, parameters, model.file="simple_1f.txt",n.chains=2, n.iter=35000, n.burnin=30000) attach.jags(output) denplot(output) traplot(output) plot(output)

If we inspect the posterior densities based on one dataset with sample size 500 we see that we pretty much hit the spot. That is, all posterior densities are covering the generating true parameters.

Please note that in the estimation code a posterior density for the difference between the data generating parameter and its’ samples from the posterior is specified. The zero is very well covered, but keep in mind that this is based only on one dataset. To replicate a more traditional approach to bias, a viable route could be to simulate several hundreds of datasets and to record for example the posterior mean for each parameter and data set to assess bias.

As always, keep in mind that this is pretty experimental code and estimating latent variable models with JAGS is not very well explored yet as far as I know. However, the main point of this post was to show how JAGS could be used in combination with R to run simulation studies in a probabilistic graphical framework. Should work for multilevel models or generalized mixed models too.

Edit: I just stumbled about this recent blog post by Robert Grand. So stan should work too. Interesting!

Pictures!

This blog looks super technical right now and one may get the impression that I am only munching probability distributions day in and day out! No! I am also munching pizza!

Keep in mind that this is my personal blog so I may also post things that interest me apart from statistics. One of those things are pictures!

There are quite some photos by me online but not so many in which I am depicted. So here is a recent one of me making pizza dough at a friends’ place:

Picture by Martin.

Bayesian estimation with JAGS and MPlus: a quick practical comparison

Bayesian estimation techniques were added to Mplus version 7 some time ago. Of course I was interested if MPlus and JAGS come to comparable results when let loose on data. đ

In particular I was interested in the crossclassified module of MPlus, so I wrote some code to simulate data with R and subsequently recovered the parameters with JAGS and MPlus. Recently I found out how JAGS could also be used as a data simulator for simulation studies, but this is a different topic.

So graphically, the model I used for simulation looks like this:

I have omitted the level-1 residual variances and the variances of the latent distributions in the figure for simplicity reasons. The full model specifications can be gleaned from the source code attached to this post.

To make a long story shot: both approaches, JAGs and MPlus, pleasently come to very comparable results with regard to the parameters. At the moment I am too busy with other things to harp on this more, but I thought it is useful to put the code on the blog in case someone is interested to work on a similar topic in more detail. Thanks to Mr Asparouhov for clarifying some questions with regard to MPlus.

The files:
test.R: code to simulate data from the model, recover the parameters with JAGs, also a bit of convergence diagnostics

test.txt: JAGs model file

cc.inp: MPlus input file

cc.dat: Simulated data. Will be overwritten once the simulation part of test.R is executed.

Get the files here. The code is not very polished but should give an idea how to specify the model in MPlus and JAGs.

Blogroll

The blog has been neglected for quite some time. At the moment I am using it more or less as a public notepad, but comments are definately invited! However, I added a blogroll.

The book by Gelman and Hill on multilevel hierarchical models helped me quite a bit with regard to the topic of multilevel/hierarchical models and it is still very useful, especially the part about modeling multivariate normal distributions for random effects with MCMC.

Felix maintains quite an active and interesting blog with topics related to science, statistical methods and R.

John R. Kruschke is known for his book on doing Bayesian data analysis, as far as I know one of the first in the field for psychologists. A book review by Wolf Vanpaemel and Francis Tuelinckx appeared in the Journal of Mathematical Psychology.

R-Bloggers is a central hub of content collected from bloggers who write about R.

Rotational Invariance in factor models II

I have been working a bit more on estimating simple factor models with JAGS. I posted the problem from the last post over there at the open discussion forum and had some very helpful input from Vincent Arel-Bundock from Michigan University and one of his co-authors, Walter R. Mebane which ultimately led to the code of the current post.

Here I wanted to know how to estimate a simple factor model for continuous manifest variables and three latent, intercorrelated factors. The code is simple and straight forward. I put a scaled inverse Wishart prior on the variance-covariance matrix of factor loadings, following the approach in Gelman & Hill (2007, p.376). For solving the invariance problem I set the loadings for one item on each factor to 1. In addition, the means of the latent trait distributions (factor values) were all constrained to zero. This is a very traditional approach, that is also implemented in e.g. MPlus, the STATA program GLLAMM or the R package lavaan.

In the following code I use the Holzinger & Swineford dataset from the lavaan tutorial for comparison. First I estimate the model with lavaan and in a second step with JAGS. Both approaches show very comparable results, but keep in mind that this is somewhat experimental. When comparing coefficients please note that I coded the model in such a way that JAGS returns estimated standard deviations and not variances for the manifest items and latent variables.

Ok, enough talk, here is the code. First the R-script:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ## holzinger.R # Experimental!   library(lavaan)   # From the lavaan tutorial HS.model <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed =~ x7 + x8 + x9 '   fit <- cfa(HS.model, data=HolzingerSwineford1939, meanstructure=TRUE)   summary(fit, standardized=TRUE)   library(R2jags) data<-HolzingerSwineford1939 ## Transform data to long format---- # the complicated way   # Dependent variable y<-c(data$x1, data$x2, data$x3, data$x4, data$x5, data$x6, data$x7, data$x8, data$x9) # Item identifier item<-c(rep(1, 301), rep(2, 301), rep(3, 301), rep(4, 301), rep(5, 301), rep(6, 301), rep(7, 301), rep(8, 301), rep(9, 301)) # Construct identifier (1: visual, 2: textual, 3: speed) const<-c(rep(1,3*301 ), rep(2,3*301 ), rep(3,3*301 )) # Person id id<-rep(seq(1:301),9) ## --------------------------------- # Fitting the model with JAGS W=(diag(3)) dat<-list("y", "item", "const", "id", "W") parameters<-(c("beta", "sigma.y", "lam", "sigma.eta1", "sigma.eta2", "sigma.eta3", "rho_eta13", "rho_eta12", "rho_eta23", "eta")) output<-jags.parallel(dat, inits=NULL, parameters, model.file="3fa.txt",n.chains=2, n.iter=3000, n.burnin=2000) plot(output) Next the model code for JAGS: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 ## 3fa.txt # Experimental! model{ # Likelihood for(i in 1:2709) # Loop over observations { # Linear decomposition could probably done more # elegantly with matrix algebra mu[i]<-beta[item[i]] + lam[const[i], item[i]]*eta[const[i], id[i]]+ lam[const[i], item[i]]*eta[const[i], id[i]]+ lam[const[i], item[i]]*eta[const[i], id[i]] # Distributional assumptions for the # observed variables y[i]~dnorm(mu[i], tau.y[item[i]]) } # Priors for the items for(i in 1:9) { beta[i]~dnorm(0,1.0E-3) tau.y[i] <- pow(sigma.y[i], -2) sigma.y[i] ~ dunif (0, 100) } # Matrix of loadings [construct, item] # Constraints and priors lam[1,1]<-1 lam[2,1]<-0 lam[3,1]<-0 lam[1,2]~dnorm(0,1.0E-3) lam[2,2]<-0 lam[3,2]<-0 lam[1,3]~dnorm(0,1.0E-3) lam[2,3]<-0 lam[3,3]<-0 lam[1,4]<-0 lam[2,4]<-1 lam[3,4]<-0 lam[1,5]<-0 lam[2,5]~dnorm(0,1.0E-3) lam[3,5]<-0 lam[1,6]<-0 lam[2,6]~dnorm(0,1.0E-3) lam[3,6]<-0 lam[1,7]<-0 lam[2,7]<-0 lam[3,7]<-1 lam[1,8]<-0 lam[2,8]<-0 lam[3,8]~dnorm(0,1.0E-3) lam[1,9]<-0 lam[2,9]<-0 lam[3,9]~dnorm(0,1.0E-3) # Scaled inverse Wishart prior for # variance-covariance matrix of factor values # after Gelman and Hill (2007) for(i in 1:301) { eta[1,i]<-xi.eta1*eta.raw[i,1] eta[2,i]<-xi.eta2*eta.raw[i,2] eta[3,i]<-xi.eta3*eta.raw[i,3] eta.raw[i,1:3]~dmnorm(eta.raw.hat[i,],tau.eta.raw[,]) eta.raw.hat[i,1]<-mu.eta1.raw eta.raw.hat[i,2]<-mu.eta2.raw eta.raw.hat[i,3]<-mu.eta3.raw } mu.eta1<-xi.eta1*mu.eta1.raw mu.eta2<-xi.eta2*mu.eta2.raw mu.eta3<-xi.eta3*mu.eta3.raw mu.eta1.raw<-0 mu.eta2.raw<-0 mu.eta3.raw<-0 xi.eta1~dunif(0,100) xi.eta2~dunif(0,100) xi.eta3~dunif(0,100) tau.eta.raw[1:3,1:3] ~ dwish(W[,],df) df<-4 sigma.eta.raw[1:3,1:3]<-inverse(tau.eta.raw[,]) sigma.eta1<-xi.eta1*sqrt(sigma.eta.raw[1,1]) sigma.eta2<-xi.eta2*sqrt(sigma.eta.raw[2,2]) sigma.eta3<-xi.eta3*sqrt(sigma.eta.raw[3,3]) rho_eta12<-sigma.eta.raw[1,2]/sqrt(sigma.eta.raw[1,1]*sigma.eta.raw[2,2]) rho_eta13<-sigma.eta.raw[1,3]/sqrt(sigma.eta.raw[1,1]*sigma.eta.raw[3,3]) rho_eta23<-sigma.eta.raw[2,3]/sqrt(sigma.eta.raw[2,2]*sigma.eta.raw[3,3]) } Rotational invariance in factor models Long time no post. Meanwhile I got my Ph.D. and I am still MCMC’ing. In particular I am interested in estimating latent variable models via MCMC with possible extensions to multilevel latent variable models. To get a grip on it I started with defining a simple one-factorial latent variable model and stumbled over the phenomenon of rotational invariance which is described in quite some detail in this techical report by Erosheva and McKay Curtis. In short, the likelihood surface of factor models is multimodal which makes it difficult to estimate this type of model with Monte Carlo methods. If one is using multiple chains they may operate on different modes of the likelihood surface. This phenomenon is also known as pole switching or label switching in latent class analysis or mixture modeling. Erosheva and McKay Curtis offer a solution based on postprocessing the markov chains. I am wondering if this is feasible, if the models become more complicated. If you want to have a more hands on experience with regard to the problem, have a look at this JAGS model. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 model{ # Likelihood # i: Persons # k: Items for(i in 1:n) { for(j in 1:k) { mu[i,j]<-beta[j]+lambda[j]*eta[i] y[i,j]~dnorm(mu[i,j], tau.y[j]) } } # Priors for(j in 1:k) { # Item-Residuals tau.y[j] <- pow(sigma.y[j], -2) sigma.y[j] ~ dunif (0, 100) # Intercepts beta[j]~ dnorm(0,1.0E-3) # Loadings lambda[j]~dnorm(0,1.0E-3) } # Factor-Values for(i in 1:n) { # Variance of latent score distribution fixed to 1 eta[i]~dnorm(0,1) } } In Psychometrics this model is also known as tau congeneric measurement model for continuous observed variables. Here I have fixed the variance of the latent trait distribution to 1 for identifiability reasons. The following R-Script simply generates some data from the model with know parameters and hands the data over to JAGS. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 library(R2jags) # Number of Persons n=100 # Number of Items k=10 # Difficulties beta.gen=rnorm(k, mean=0, sd=1) # Abilities a<-(rnorm(n, 0, sd=1)) eta.gen<-a-mean(a) # Error standard deviations sigma.gen=runif(k)*3 y<-matrix(rep(0,n*k), nrow=n, ncol=k) # Produce data for(i in 1:n) { for(j in 1:k) { y[i,j]=rnorm(1,mean=beta.gen[j]+eta.gen[i], sd=sigma.gen[j]) } } data<-list("y", "n", "k") parameters<-c("beta", "sigma.y", "lambda", "eta") # Starting values inits <- function(){ list(beta = rnorm(k), lambda=c(rnorm(k)), eta=c(rnorm(n)), sigma.y=runif(k)*100) } output<-jags(data, inits=NULL, parameters, model.file="1f_lat_var.txt",n.chains=2, n.iter=5000, n.burnin=2500) attach.jags(output) plot(output) The main interest here is to check, if the data generating parameters could be reasonably recovered. If you run the code – and if you are lucky – the two employed markov chains will converge to one solution. But it may well be that the two chains will end up operating in two different modes of the likelihood surface. In this case you will see a reflection of the two chains for the loadings and the factor values around the y-axis in the plots. This does not mean that the model is ‘wrong’ or something like that. It simply shows that there are two equivalent solutions. In practice one could simply choose one chain and be happy with that, but if one wants to extend the model by introducing more latent variables or even multilevel structures, the multimodality could be problematic when using MCMC-methods. So, if anyone has a simple solution for dealing with this phenomenon, let me know! Meanwhile I am having a look at pymc. Some short thoughts on MCMC methods While MCMC is certainly a useful method and has the potential to adress several difficulties that may exist with more traditional methods of parameter estimation, in my opinion it is not a magic bullet for several reasons: 1. The estimation procedure can be slow, especially for larger datasets. 2. The use of prior distributions may deter some researchers. 3. Sometimes it is not clear which prior distribution is appropriate. 4. Model specification in code requires a thorough understanding of the model in question from a formal point of view. On the other hand, the potential advantages are: 1. Unified handling of a wide scope of models within the framework of graphical models. 2. Unified method for checking a model (posterior predictive checks). 3. Possibility of obtaining posterior distributions of derived statistics, such as effect size indices. 4. Model based handling of missing values and possibility of obtaining posterior distributions of missings in the dependent variable. 5. Possibility of rapid model prototyping. For building confidence it may be useful to apply the MCMC method to the wide range of models used in the social and behavioral sciences, to cross check against the traditional criteria of parameter estimation (bias, consistency, efficiency) by simulation studies and to examine the influence of the prior distributions on the posterior distributions. But I guess this has already been done. I should check the literature. The Partial Credit Model as PGM The last post’s topic about multinomial logistic regression leads very nicely to a wide range of psychometric models for categorial responses, such as the Partial Credit Model (PCM) by Masters (1982). The PCM is a psychometric model that estimates the probability of a person’s response to an item based on characteristics of the item and the person’s latent ability. One particular nice feature of this model is the fact that Andrich’s rating scale model (RSM) and the dichotomous Rasch model (RM) are – as e.g. Mair and Hatzinger (2007) demonstrated – special cases of the PCM. It is fair to say that Samejima seems to have worked on a similar structure back in the sixties. The focus of this post is on specifying the PCM as a graphical model, estimating the parameters via MCMC and crosschecking the results against the eRm package by Mair and Hatzinger (2007). Let’s have a look at the formal structure of the PCM first. Starting point is the base distribution of multinomial logistic regression, sometimes called the “softmax activation function” in the machine learning literature: (1) The next step is to linearily decompose the parameters with respect to the PCM. For simplicity I will consider the case of four categories per item only. Generalization to categories is simple and straight forward. (2) So, is the ability of person on the latent trait scale, and is the so called threshold parameter of item for the categories and . For example, a threshold parameter means that for item 1, given a latent ability of , the probability of choosing the category scored with 0 is equal to the probability of choosing the category scored with 1. This can be readily verified by for example plotting the item response functions for a set of threshold parameters. To estimate the PCM as a PGM, it is useful to introduce a multilevel perspective with respect to the persons’ abilities: We assume that the latent abilities are normally distributed with a mean of zero and a variance of , which can be estimated from data. This is actually quite nice, because approaches exist to estimate the separability or reliability based on the variance of the latent trait distribution. Please note that the parametrization of the PCM above is a bit different from the original specification by Masters (1982). The Bugs-Code for the model is straight forward: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 # PCM.txt model{ for (i in 1:n) { lambda[1,i]<-0 lambda[2,i]<-1*eta[id[i]]-tau[item[i],1] lambda[3,i]<-2*eta[id[i]]-tau[item[i],1]-tau[item[i],2] lambda[4,i]<-3*eta[id[i]]-tau[item[i],1]-tau[item[i],2]-tau[item[i],3] Z[i]<-exp(lambda[1,i])+exp(lambda[2,i])+exp(lambda[3,i])+exp(lambda[4,i]) for(c in 1:4) { p[i,c]<-exp(lambda[c,i])/Z[i] } y[i]~dcat(p[i,1:4]) } # Priors for(j in 1:k) { tau[j,1]~dnorm(0,1.0E-6) tau[j,2]~dnorm(0,1.0E-6) tau[j,3]~dnorm(0,1.0E-6) } for(i in 1:N) { eta[i]~dnorm(0,tau.eta) } tau.eta<-pow(sigma.eta,-2) sigma.eta~dunif(0,2) } Please note that the data have to be submitted in long format. Now, let’s do something interesting with this. The following R-script uses the rsmdat  example dataset in the eRm-package, transforms it into long format and submits it to JAGS. JAGS generates the posterior distributions the Bayesian way and returns the output to R. Subsequently, the threshold- and ability-parameters are estimated with the pcm()-function in the eRm-package and the results are compared graphically. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 # PCM.R library(R2jags) library(eRm) # Load rsmdat data<-rsmdat # Number of thresholds per item t<-max(data) # Number of persons N<-dim(data)[1] # Number of items k<-dim(data)[2] # Transform data into long format index=1 id<-c(0) item<-c(0) y<-c(0) for(i in 1:N) { for(j in 1:k) { y[index]=data[i,j]+1 id[index]<-i item[index]<-j index<-index+1 } } n<-length(y) # Prepare arguments for jags() data<-list("y","n","N","id", "item", "k") parameters<-c("tau","eta", "sigma.eta") inits <- function(){ list(tau = array(rnorm(t*k),c(k,t)), eta=rnorm(N), sigma.eta=runif(1)*2) } # Fit model output<-jags(data, inits=inits, parameters, model.file="PCM.txt", n.chains=2, n.iter=4000, n.burnin=2000) attach.jags(output) # Checking against eRm ######## # Fit model with eRm out<-PCM(rsmdat) # Collect posterior means of threshold parameters # from jags output t1<-apply(tau[,1,],2,mean) t2<-apply(tau[,2,],2,mean) t3<-apply(tau[,3,],2,mean) t4<-apply(tau[,4,],2,mean) t5<-apply(tau[,5,],2,mean) t6<-apply(tau[,6,],2,mean) thresholds_pgm<-c(t1,t2,t3,t4,t5,t6) # Collect threshold parameters from eRm thresholds_eRm<-thresholds(out)$threshpar # Plot PGM thresholds against eRm thresholds plot(thresholds_eRm, ylim=c(-2,2.5), col=c("red"), ylab="tau") points(thresholds_pgm, ylim=c(-2,2.5), col=c("green"))   # Collect posterior means of # ability parameters from jags eta_pgm<-apply(eta,2,mean) # Collect ability parameters # from eRm eta_eRm<-person.parameter(out)$theta.table[,1] plot(apply(rsmdat,1,mean),eta_eRm, xlim=c(0,3), col=c("red"), xlab="mean score",ylab="eta") points(apply(rsmdat,1,mean),eta_pgm, xlim=c(0,3), col=c("green")) I will spare you the outputs of JAGS and eRm but encourage you to download the scripts and to have a look yourself. Lets jump directly to the plots. Fig. 1.: Threshold parameters estimated by eRm (red) and JAGS (green) Visual inspection shows that eRm and JAGS essentially agree with regard to the locations of the threshold parameters. But there seems to be a difference between JAGS and eRm that is more pronounced for higher threshold parameters. The reason for this is probably the fact that JAGS samples from the posterior distributions of the parameters while eRm is using the CML-method for threshold estimation. Let’s have a look at the ability estimates. Fig. 2.: Ability estimates obtained by eRm (red) and JAGS (green) vs. mean score In Figure 2, the differences between eRm and JAGS seem to be more pronounced. While the relationsship between the mean scores and the ability estimates seems to be more or less linear for the estimates produced by JAGS, this is not the case for eRm. The reason for this is probably the fact that the ability estimates generated by eRm were produced by using the ML-method with CML-estimates for the thresholds plugged in. Warm’s weighted ML method of person parameter estimation will most likely be included into future versions of eRm. Which estimation method is better, Bayesian estimation with JAGS or CML combined with ML as implemented in eRm? It is hard to say from this comparison, a more systematic and thorough simulation study would be due that can be easily constructed from the code above. But note that MCMC-methods can be slow, especially for larger datasets. The point of this post was mainly to demonstrate, how psychometric models could be fit to data with MCMC and to cross-check the results against eRm. The scripts can be obtained here. Multinomial Logit Regression as PGM I decided to switch to English as I believe that this topic may be interesting for a wider audience. In the last post I have given a simple example of how the t-test for independent samples can be understood from a generalized linear modeling (GLM) perspective, how the model translates into a probabilistic graphcal model (PGM) and how to use Bayesian estimation techniques for parameter estimation. The main motivation behind that post was to show, how generalized linear models, probabilistic graphical models and bayesian estimation techniques can work together for solving statistical inference tasks, particularly in but not limited to Psychology. I can hear the one or the other say: “Pah, t-test, peanuts!”. So in this post I would like to turn to something a bit more advanced: Multinomial Logit Regression (MLR). One task of MLR is to model the probability of a categorical response out of a set of mutually exclusive possible responses based on a set of predictors that may characterize for example the repondents. Typical questions are: Which characteristics of a person predict the choice of a specific party (Political Science) or a specific product (Economics) or a specific category on a multiple choice questionaire (Psychology). This type of model is also quite prominent in the machine learning community. One task is to classify text documents into mutually exclusive categories based on a set of text features as predictors. Before we can specify the model as a PGM, we should have a look at its formal structure. First we need a probability distribution that is capable of modeling the probabilty of a certain categorical response based on a set of parametetrs. One such distribution is: (1) This is quite neat. The denominator of that distribution is sometimes called the partition function or in German Zustandssumme from its use in statistical physics. It ensures that the probabilites for the possible events, states or responses sum up to one. Please note that the distribiution above has a little identification problem that can be solved by either setting to zero or by imposing a sum constraint over the parameters. Here we set to zero: (2) This also means that we are using category 1 as a reference category for inference. More on that later. The trick in MLR is to simply decompose the parameters of the base distribution by simple linear regression equations: (3) That’s basically it. The -weights have to be estimated from data, either by maximum likelihood or MCMC. To clarify the interpretation of the parameters and to elaborate a bit more, let’s assume we have three possible categories and only one predictor . The probabilities of choosing a certain category are: (4) with partiton function Please note that runs over all three possible states of the system. If we write down the log-probability-ratio (or log-risk-ratio) of choosing category 2 over category 1 we get: (5) From this equation the meanings of the parameters become clear. is nothing but the log-probability-ratio of choosing category 2 over category 1, if is zero. is the expected change in terms of log-probabilty-ratios per unit increase of . The log-probability-ratio of choosing category 3 over category 1 is: (6) Here we are predicting log-probability-ratios of choosing category 3 over category 1 based on the predictor . is the log-probability-ratio of choosing category 3 over category 1, if is zero and is the expected change in terms of log-probability-ratios per unit increase of . Quite easy and very similar to multiple regression. The only thing one has to keep in mind is that we are predicting log-probability-ratios instead of changes in one continuous variable. We are now in the position of having a graphical look at this simple model. Fig. 1: Graphical representation of the simple multinomial logit regression model in the example The variable y[i] is categorically distributed with parameters l1[i], l2[i] and l3[i]. l1[i] is set to zero for indentification purposes and l2[i] and l3[i] are linearily decomposed, e.g. due to the underlying design, if we are working experimentally. x0[i] is a dummy variable for the constants ( and ) of the regression equations and x1[i] is the predictor. The arrows symbolize dependencies between the nodes of the network. To show how the model’s parameters of are estimated via MCMC in a Bayesian fashion, let’s introduce a little example. Let’s assume we have sampled individuals that are falling into two groups (). Each individual had the choice between 3 alternatives (). We are interested in the question: Do the groups differ with respect to the chosen alternatives? Or: Does the chosen category depend on group membership? If we crosstabulate the data in R we find: 1 2 3 4 5 6 > table(y,x) x y 0 1 1 5 3 2 13 4 3 2 14 By visual inspection we see that group 0 seems to have preference for alternative 2 and group 1 seems to have a preference for alternative 3. To model these data and to estimate the parameters via MCMC we have to translate the model into OpenBUGS code. This is straight forward from the graphical representation or the equations: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 #uo_multinom_reg.txt model { # Likelihood for(i in 1:n) { lambda[1,i]<-0 lambda[2,i]<-beta_20+beta_21*x[i] lambda[3,i]<-beta_30+beta_31*x[i] # Partition function Z[i]<-exp(lambda[1,i])+exp(lambda[2,i])+exp(lambda[3,i]) for(c in 1:3) { p[i,c]<-exp(lambda[c,i])/Z[i] } y[i]~dcat(p[i,1:3]) } # Priors beta_20~dnorm(0,1.0E-6) beta_21~dnorm(0,1.0E-6) beta_30~dnorm(0,1.0E-6) beta_31~dnorm(0,1.0E-6) } The choice of normal priors for the parameters is motivated by the theoretical result that maximum likelihood etimates are asymptotically normally distributed. In addition, the precision of the priors is set to a very low value. So these distributions encode the prior information that we really do not know anything about the distributions of the parameters in the population yet and want to let the data speak for themselves as much as possible. The R package R2OpenBUGS is used to control the OpenBUGS script: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # uo_multinom_reg.R library(R2OpenBUGS) data<-read.table("uo_multinom_reg.csv", sep=",", head=TRUE) # data preparation y<-data$wahl x<-data$geschl n<-length(y) data<-list("y","x", "n") parameters<-c("beta_20", "beta_30", "beta_21", "beta_31") # Initial values for the markov chains inits <- function(){ list(beta_20 = rnorm(1), beta_21=rnorm(1), beta_30=rnorm(1), beta_31=rnorm(1)) } # Handing over to OpenBUGS output<-bugs(data, inits=inits, parameters, model.file="uo_multinom_reg.txt", n.chains=2, n.iter=8000, n.burnin=1000) # Printing the results print(output, digits=2) The output of OpenBUGS is Inference for Bugs model at "uo_multinom_reg.txt", Current: 2 chains, each with 11000 iterations (first 1000 discarded) Cumulative: n.sims = 20000 iterations saved mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff beta_20 1.01 0.55 -0.02 0.64 0.98 1.36 2.13 1 2200 beta_30 -1.09 0.94 -3.22 -1.67 -1.01 -0.44 0.49 1 2600 beta_21 -0.68 1.01 -2.66 -1.35 -0.69 -0.01 1.31 1 1400 beta_31 2.78 1.16 0.76 1.98 2.68 3.48 5.35 1 2100 deviance 74.84 3.01 71.11 72.61 74.17 76.25 82.69 1 20000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = Dbar-Dhat) pD = 4.1 and DIC = 79.0 DIC is an estimate of expected predictive error (lower deviance is better).  The statistics indicate that the markov process seems to have reached a stationary distribution. The parameter beta_20 indicates a preference of category 2 over category 1 for group 0. But this difference is statistically not significant, probably due to small sample size. beta_30 indicates a preference of category 1 over category 3 for group 0, but from the percentiles of the posterior distributions we can see that this difference is also not significant statistically. beta_21 is the contrast of group 1 to beta_20. As we can see, group 1 seems to have a slightly lower preference for category 2 over category 1 than group 0 in terms of log-probability-ratios, but this contrast is also not significant statistically. beta_31 is statistically significant. Group 1 has a much stronger preference for category 3 over category 1 than group 0 in terms of log-probabilty-ratios. Please note that this setup can be extended to include a larger set of continuous and categorical predictors. Also a multilevel extension of the model seems to be straight forward by “simply” including random effects into the linear regression equations. The main objective of this post was to show in a simple way, how categorical data analysis techniques translate into a graphical representation that in terms can be used to estimate model parameters with MCMC-methods. Grab the code here. Der t-Test fĂŒr unabhĂ€ngige Stichproben als GLM und PGM The proof of the pudding is in the eating. Von daher mĂ¶chte ich an einem kleinen Beispiel zeigen, wie Generalisierte Lineare Modelle (GLMs), Probabilistische Graphische Modelle (PGMs) und Bayesianische Methoden praktisch zusammenspielen kĂ¶nnen. Was bietet sich besseres an, als der t-Test fĂŒr unabhĂ€ngige Stichproben? Um eine gewisse Vergleichbarkeit zu gewĂ€hrleisten, beziehe ich mich auf das Beispiel zum t-Test in Bortz, Statistik, 6. Auflage (Tabelle 5.1., Seite 142). Zur praktischen Demonstration verwende ich die Software R und OpenBUGS. Die verwendeten Dateien finden sich hier. Der t-Test fĂŒr unabhĂ€ngige Stichproben als GLM Den meisten dĂŒrfte bekannt sein, dass die Fragestellung hinter dem t-Test sich auch mit der multiplen Regression, bzw. dem Allgemeinen Linearen Modell (ALM) bearbeiten lĂ€sst. Im Beispieldatensatz aus dem Bortz geht es darum zu prĂŒfen, ob sich MĂ€nner und Frauen hinsichlich ihrer Belastbarkeit unterscheiden. Mit R geht das ganz einfach mit der Funktion glm(). 1 2 3 4 5 6 7 8 # t-test_ua.R # Laden der Daten data<-read.table("B_5_1.csv", sep=",", head=T) # Anpassung eines GLM mit IdentitĂ€ts-Link-Funktion m1<-glm(belast~geschl, data=data, family="gaussian") summary(m1) Es wird ein GLM mit IdentitĂ€ts-Link-Funktion an die Daten angepasst. R liefert folgenden Output: Call: glm(formula = belast ~ geschl, family = "gaussian", data = data) Deviance Residuals: Min 1Q Median 3Q Max -18.242 -9.732 -2.221 9.289 25.800 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 103.200 2.130 48.452 <2e-16 *** geschlw 1.042 3.057 0.341 0.734 --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 (Dispersion parameter for gaussian family taken to be 158.7827) Null deviance: 10498 on 67 degrees of freedom Residual deviance: 10480 on 66 degrees of freedom AIC: 541.54 Number of Fisher Scoring iterations: 2  Der mit Intercept bezeichnete SchĂ€tzer ist der vorhergesagte Mittelwert der Belastbarkeit der MĂ€nner und der SchĂ€tzer geschlw ist der Kontrast der Frauen zu dieser Baseline. Das bedeutet, dass Frauen einen um 1.042 Punkte hĂ¶heren vorhergesagten Wert auf der Belastbarkeits-Skala aufweisen. Allerdings ist anhand des t-Werts von 0.341 deutlich ersichtlich, dass dieser Unterschied statistisch nicht signifikant ist. Dieses Ergebnis deckt sich im Wesentlichen mit dem Ergebnis im Bortz, wo ein t-Wert von -0.33 verzeichnet ist. Der t-Test fĂŒr unabhĂ€ngige Stichproben als PGM Das eben verwendete GLM lĂ€sst sich auch als grafisches Modell darstellen. Abbildung 1 zeigt die grafische ReprĂ€sentation. Abbildung 1: Das GLM fĂŒr den t-Test fĂŒr unabhĂ€ngige Stichproben als PGM y[i] bezeichnet die Werte der abhĂ€ngigen Variable und x1[i] und x2[i] bezeichnen die Werte der jeweiligen PrĂ€diktoren in der Design-Matrix. mu[i] ist der lineare PrĂ€diktor des Modells, der nach MaĂgabe des zugrundeliegenden Designs linear zerlegt wird. sigma.y ist die Streuung der Residuen in der Population. Die Pfeile bezeichnen AbĂ€ngigkeiten zwischen den Knoten des Netzwerks. Der Rahmen (plate notation) um die Knoten symbolisiert, dass Beobachtungen vorliegen. Formal lĂ€sst sich das Modell relativ einfach darstellen: Es wird also angenommen, dass die beobachteten Werte einer Normalverteilung mit der Mittelwertsstruktur und der Varianz entstammen. Oder anders gesagt, es wird angenommen, dass die Residuen in der Population mit der Varianz normal verteilt sind. Wenn die Dummy-Codierung fĂŒr der Spezifikation einer Konstanten entspricht, handelt es sich also um nichts anderes, als um eine einfache lineare Regression in grafischer Darstellung. Bayesianische Bestimmung der Parameter Die Parameter , und die Varianz mĂŒssen aus den Daten geschĂ€tzt werden. Wie dies auf Bayesianische Art mit Hilfe der MCMC-Methode, R und OpenBUGS geschehen kann, sei kurz gezeigt. ZunĂ€chst muss eine Datei mit der Modellspezifikation fĂŒr OpenBUGS erstellt und gespeichert werden. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # t-test_ua.txt model { # Likelihood for (i in 1:n) { y[i] ~ dnorm (mu[i], tau.y) mu[i]<-beta0*X[i,1]+beta1*X[i,2] } # Prior-Distributions beta0 ~ dnorm(0, 1.0E-6) beta1 ~ dnorm(0, 1.0E-6) tau.y <- pow(sigma.y, -2) sigma.y ~ dunif (0, 1000) # Cohen's delta delta<-(beta1)/sigma.y } Die Spezifikation des Modells erfolgt im Abschnitt #Likelihood. AnschlieĂend werden die Prior-Verteilungen spezifiziert, die hier relativ uninformativ gehalten sind. Als besonderer Bonus wird die Posterior-Verteilung von Cohen's delta auf Basis des MCMC-Prozesses erzeugt. OpenBUGS wird durch das R-Skript angesteuert, das auch die Daten an OpenBUGS ĂŒbergibt. 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 library(R2OpenBUGS) # Vorbereitung der Daten zur # Ăbergabe an OpenBUGS y<-data$belast X<-model.matrix(m1) n<-length(y) data<-list("y","X", "n")   # Spezifikation der zu bestimmenden Parameter parameters<-c("beta0", "beta1", "sigma.y", "delta")   # Ăbergabe an OpenBUGS output<-bugs(data, inits=NULL, parameters, model.file="t-test_ua.txt", n.chains=2, n.iter=5000, n.burnin=1000)   # Ausgabe der Ergebnisse print(output, digits=2)   # Histogramm von beta1 hist(beta1,100) # Histogramm von Cohen's delta hist(delta,100)

Um die Konvergenz des Markov-Prozesses zu ĂŒberprĂŒfen, kommen zwei Markov-Ketten zum Einsatz. Der Prozess durchlĂ€uft 5000 Iterationen, wobei die Burn-In-Phase 1000 Iterationen betrĂ€gt. Die Startwerte des Markov-Prozesses werden im Beispiel zufĂ€llig erzeugt.

OpenBUGS liefert folgenden Output:

Inference for Bugs model at "t-test_ua.txt",
Current: 2 chains, each with 5000 iterations (first 1000 discarded)
Cumulative: n.sims = 8000 iterations saved
mean   sd   2.5%    25%    50%    75%  97.5% Rhat n.eff
beta0       103.25 2.14  99.06 101.80 103.20 104.70 107.40    1  8000
beta1         0.99 3.07  -5.06  -1.02   0.97   2.98   7.06    1  8000
sigma.y      12.84 1.15  10.82  12.03  12.75  13.55  15.35    1  8000
delta         0.08 0.24  -0.39  -0.08   0.08   0.23   0.55    1  8000
deviance    538.62 2.55 535.80 536.80 538.00 539.70 545.00    1  4300

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = Dbar-Dhat)
pD = 2.9 and DIC = 541.5
DIC is an estimate of expected predictive error (lower deviance is better).

Von Interesse sind hier besonders die Mittelwerte und Perzentile der Posterior-Verteilungen und Rhat. Der SchĂ€tzer fĂŒr die mittlere Belastbarkeit der MĂ€nner betrĂ€gt 103.25 und der geschĂ€tzte Kontrast der Frauen zum Mittelwert der MĂ€nner betrĂ€gt 0.99. Das 95%-KredibilitĂ€ts-Intervall fĂŒr liegt zwischen -5.06 und 7.06. Damit ist der Unterschied zwischen MĂ€nnern und Frauen statistisch nicht signifikant. Der SchĂ€tzer fĂŒr Cohen's delta betrĂ€gt 0.08. Dieses Ergebnis ist mit dem Resultat im Bortz kongruent. Zudem wurde das 95%-KredibilitĂ€ts-Intervall fĂŒr Cohen's delta ermittelt. Dieses Intervall liegt zwischen -0.39 und 0.55, was ebenfalls darauf hindeutet, dass der Unterschied zwischen MĂ€nnern und Frauen hinsichtlich der Belastbarkeit statistisch nicht signifikant ist. Die Statistik Rhat zeigt an, dass die zwei eingesetzten Markov-Ketten gegen die Posterior-Verteilung der Parameter konvergieren.

Was auffĂ€llt ist, dass in der Ausgabe keine p-Werte vorkommen. Das ist nicht weiter tragisch, da die Inferenz auf Basis der KredibilitĂ€ts-Intervalle geschieht. Wer aber dennoch nicht auf p-Werte verzichten mĂ¶chte, kann diese einfach aus den Posterior-Verteilungen berechnen. Zum Abschluss sei hier noch das Histogramm der Posterior-Verteilung von Cohen's delta dargestellt.

Abbildung 2: Posterior-Verteilung von Cohen's delta

Das war zugegebener MaĂen ziemlich viel Information auf einmal und Anspruch dieser Blog-Post kann es nicht sein, die HintergrĂŒnde der hier verwendeten Verfahren umfassend zu erklĂ€ren. Den interessierten Leser oder die interessierte Leserin verweise ich auf die Literatur. Mir war es jedoch wichtig, einmal an einem einfachen Beispiel zu zeigen, wie GLMs, PGMs und Bayesianische Methoden ineinandergreifen kĂ¶nnen.

Als Ăbung empfehle ich zu ĂŒberlegen, wie das grafische Modell und der OpenBUGS-Code im Falle der multiplen Regression aussehen mĂŒssten. Bonuspunkte gibt es fĂŒr die Erzeugung der Posterior-Verteilung fĂŒr den Determinations-Koeffizienten, bzw. . Interessant wĂ€re auch ein systematischer Vergleich der hier dargestellten Methode zur Erzeugung von KredibilitĂ€ts-Intervallen fĂŒr EffektstĂ€rke-Indices mit der âfrequentistischen" Methode zur Erzeugung von Konfidenz-Intervallen auf Basis von nichtzentralen Verteilungen.