Structural Equation Modelling (SEM) and Multi-group SEM using R

Structural Equation Modeling (SEM) is a multivariate statistical analysis technique that is used to analyze structural relationships among variables. SEM is the combination of factor analysis and multiple regression analysis. Usually factors are created using multiple observed variables through factor analysis. Those factors are called latent variables. Thereafter, multiple regression analysis is performed on latent variables level, not in observed variables level. Let’s assume that you have proper theoretical knowledge about Structural Equation Modeling. If not, it is suggested that you read the books- “Ullman, J.B. and Bentler, P.M., 2003. Structural equation modeling. John Wiley & Sons, Inc.” and “Bollen, K.A. and Curran, P.J., 2006. Latent curve models: A structural equation perspective (Vol. 467). John Wiley & Sons.”

Common softwares for conducting SEM analysis are MPLUS, AMOS, SmartPLS, STATA and R. All the mentioned softwares come with a price but R. R is a free statistical analysis tool and here the codes of doing SEM and multi-group SEM using the ‘lavaan’ package are presented. For detail you may read “Rosseel, Y., 2011. lavaan: an R package for structural equation modeling and more.”

 

##to load data in R

data<- read.csv(“F:/1. STUDY /WORK in Progress/working data/version7/worldbankdataset16.csv”)

summary(data)

 

#select observations of required variables without missing values #’na.omit’ for dropping variables with missing values #subset for creating subset of the main datafile

portmodel=na.omit(subset(data, select=c(LPIAT12, LPICQ12, LPIEA12, LPIEC12, LPIFS12, LPIQT12, CT12,LC12, QPI12, PPGDPc12, UNDC)))

summary (portmodel)

 

#required packages for descriptive stats, normality tests & SEM

library(psych)

library(MVN)

library(lavaan)

library(semPlot)

library(xlsx)

 

#Descriptive stats

summary(portmodel) #all at once but limited

describe(portmodel$QPI12)

describe(portmodel$LPIAT12)

describe(portmodel$LPICQ12)

describe(portmodel$LPIEA12)

describe(portmodel$LPIEC12)

describe(portmodel$LPIFS12)

describe(portmodel$LPIQT12)

describe(portmodel$CT12)

describe(portmodel$LC12)

describe(portmodel$PPGDPc12)

 

#normality tests

install.packages(“MVN”) #to install MVN package

library(MVN) #to use MVN package

uniPlot(subset, type=”qqplot”) #creates univariate Q-Q plots

uniPlot(subset, type=”histogram”) #creates univariate histograms

 

 

#Linear Regression just to check different relationships

fit = lm(PPGDPc12 ~  CT12+LC12+LPIAT12, data=portmodel)

fit = lm(PPGDPc12 ~ LPIAT12 + CT12+LC12+QPI12+ GERS12 + UP12, data=portmodel)

summary(fit) # show results

corr.test(portmodel$CT12, portmodel$LPIAT12)

 

#factor reliability check

LPI <- subset(portmodel, select=c(LPIAT12, LPICQ12, LPIEA12, LPIEC12,LPIFS12,LPIQT12))

alpha(LPI) #0.97

 

ST<-subset(portmodel, select=c(CT12, LC12))

alpha(ST) #0.86

 

##Confirmatory factor analysis (CFA) model

cfamodel1 <- ‘QPI=~ QPI12

LPI =~ LPIAT12 + LPICQ12 + LPIEA12  + LPIEC12+LPIFS12+LPIQT12

POUT =~ CT12 +LC12

ECON=~PPGDPc12’

#unfortunately I have two latent variables (QPI and ECON) with only one observed variable. Always try to have multiple observed variables.

 

#CFA results

fit.cfamodel1<- cfa(cfamodel1, data=portmodel,  estimator=”MLM”, mimic= “Mplus”,std.lv=T)

#MLM estimation used as data are not normally distributed, if normal use “ML”

summary(fit.cfamodel1,  fit.measures=TRUE, standardized=TRUE, rsq=TRUE)

fitMeasures(fit.cfamodel1)

#reliability check

reliability(fit.cfamodel1)

 

inspect(fit.cfamodel1,”cov.lv”)

inspect(fit.cfamodel1,”cor.lv”)

inspect(fit.cfamodel1,”theta”)

 

 

#Structural Equation Model (SEM)

semmodel1 <- ‘LPI =~ LPIAT12 + LPICQ12 + LPIEA12 + LPIEC12 + LPIFS12+ LPIQT12

POUT =~  CT12 +LC12

QPI=~QPI12

PPGDPc=~PPGDPc12

 

LPI~a*QPI

POUT~b*LPI+c*QPI

PPGDPc~d*POUT+e*QPI+f*LPI

 

#indirect effect

ab:= a*b

bd:= b*d

cd:= c*d

abd:=a*b*d

LPIEC12 ~~  LPIQT12’  #correlation between error terms

 

#SEM  results and analysis

fit.semmodel1<- sem(semmodel1, data=portmodel, estimator=”MLM”, mimic= “Mplus”)

summary(fit.semmodel1,  fit.measures=TRUE, standardized=TRUE, rsq=TRUE)

fitMeasures(fit.semmodel1)

 

semPaths(fit.semmodel1, whatLabels = “std”, layout=”tree2″, curve = TRUE)

MI<- modindices(fit.semmodel1)

subset(MI, mi > 1.8410)

 

inspect(fit.semmodel1,”cor.lv”)

inspect(fit.semmodel1,”theta”)

 

#Further investigations

#to separate latent correlations

pred<- predict(fit.semmodel1)

head(pred)

pred

#creating dataset with the latent factor values

latentdata<-data.frame(pred)

 

#correlation among latent and observed variables

corrdata<-data.frame(latentdata,portmodel$QPI12,portmodel$PPGDPc12)

cor(corrdata, use =”pairwise.complete.obs”, method=”pearson”)

 

##Univariate Normality Test

#shapiro-wilk test

shapiro.test(latentdata$LPI)

shapiro.test(latentdata$POUT)

shapiro.test(portmodel$QPI12)

shapiro.test(portmodel$PPGDPc12)

 

#Multi-group Structural Equation Model (SEM)

#Measurement Invariance check for Multi-group modelling

library(semTools)

measurementInvariance(semmodel2, data=portmodel, estimator=”MLM”,group=”UNDC”, mimic= “Mplus”)

 

# Multi-group model specification

semmodel2 <- ‘LPI =~ LPIAT12 + LPICQ12 + LPIEA12 + LPIEC12 + LPIFS12+ LPIQT12

POUT =~  CT12 +LC12

 

LPI~c(a,g)*QPI12

POUT~c(b,h)*LPI+c(c,i)*QPI12

PPGDPc12~c(d,j)*POUT+c(e,k)*QPI12+c(f,l)*LPI

 

LPIEC12 ~~  LPIQT12

CT12~~c(a,a)*CT12  #c(a,a) to constrain variances to be equal over groups

#indirect effect

bd:=b*d

hj:=h*j

cd:=c*d

ij:=i*j

abd:=a*b*d

ghj:=g*h*j’

 

#LPIQT12~~c(a,a)*LPIQT12 #to constrain variances to be equal over groups

#LPIQT12~~c(NA,0.001)*LPIQT12 #to fix variance of a particular groups

 

#Multi-group SEM results

fit.semmodel2<- sem(semmodel2, data=portmodel, group=”UNDC”, group.equal = c(“loadings”), estimator=”MLM”)

summary(fit.semmodel2,  fit.measures=TRUE, standardized=TRUE, rsq=TRUE)

inspect(fit.semmodel2,”theta”)

inspect(fit.semmodel2,”cor.lv”)

 

Some comments:

In the fit.semmodel2, you would notice that I have kept the factor loadings equal across groups while comparing regression co-efficients of the developed and developing country groups. The comment presented below, by Linda K. Muthen explains the reason.

 

Click to access the published article of this code!

 

If you found it useful, please share!

 

By Ziaul Haque Munim 

 

Leave your thoughts

The ResearchHUB

A platform to support quality academic research. The goal is to raise awareness among young students and researchers about the need of good research, and guide them to achieve excellence in academia. The platform is run by a team of aspiring PhD students residing in many different countries of the world.

NEWSLETTER