sgmediation2
Sobel-Goodman Tests of Mediation in Stata
sgmediation2 is a user-written Stata command which conducts Sobel-Goodman tests of statistical mediation for linear regression models. sgmediation2 is my update (with permission) to the original command sgmediation, written by Phil Ender of the UCLA Statistical Consulting Group. Questions or requests for additions to the command should be sent to me at tmize@purdue.edu
The sections below detail:
Installation instructions
Citation
Background on the statistical tests
Illustration of the sgmediation2 command
Bootstrapped standard errors and confidence intervals
Noteworthy new features of sgmediation2
Limitations of the Sobel-Goodman/product of coefficients approach to mediation
References
(1) Installation Instructions
To install in Stata:
net install sgmediation2, from("https://tdmize.github.io/data/sgmediation2")
Once installed, to read the help file (also available here):
help sgmediation2
(2) Citation
Please cite use of the sgmediation2 command as:
Mize, Trenton D. 2022. "sgmediation2: Sobel-Goodman tests of mediation in Stata." https://www.trentonmize.com/software/sgmediation2
(3) Background on the statistical tests
sgmediation2 calculates three different tests of mediation using the "product of coefficients" approach (MacKinnon et al. 2002). Neither this site or the command sgmediation2 is meant as a blanket endorsement of this approach to mediation. Many approaches to mediation exist with many pros and cons to each (e.g. see MacKinnon et al. 2002; Zhao et al. 2010; Keele 2015). Some prominent limitations of this approach along with a suggested alternative for many cases are detailed in Section #6 at the bottom of this page.
The commonly used approach to mediation based on Baron and Kenny (1986) suggests that a variable may be considered a mediator to the extent to which it carries the influence of a focal independent variable (IV) to a given dependent variable (DV). In this framework, mediation can be said to occur when (1) the IV significantly affects the mediator, (2) the IV significantly affects the DV in the absence of the mediator, (3) the mediator has a significant unique effect on the DV, and (4) the effect of the IV on the DV shrinks upon the addition of the mediator to the model.
Others (e.g., Preacher and Hayes 2004) suggest that only two requirements need be met: (1) the IV has a significant effect before the mediator is added to the model, and (2) the effect of the IV shrinks upon the addition of the mediator to the model (i.e. same requirement as #4 above). Simplifying even further, many now suggest (e.g., Zhao, Lynch, and Chen 2010) that the only needed requirement is that the effect of the IV shrinks upon the addition of the mediator to the model (AKA there is a significant indirect effect; see below for details) because mediation can occur even in the absence of a direct effect of the IV. Yet another approach suggests using the steps given by Baron and Kenny (1986) but determining mediation as whether or not the effects in steps #1 and #3 are both significant (e.g., Yzerbyt et al. 2018).
sgmediation2 provides tests of all of the various requirements discussed above to facilitate most any test desired. I personally agree that the test that the effect of the IV shrinks upon the addition of the mediator to the model (i.e. the indirect effect) is of most central interest. But as Zhao et al. (2010) detail -- the individual tests outlined by Baron and Kenny (1986) are still quite useful to determine the specific nature of mediation found.
Mediation tests
The right half of the figure below illustrates the basic logic of a mediating relationship where the mediating variable (MV) is theoretically at least partially the reason/mechanism by which the focal independent variable (IV) affects the outcome (DV). Because of this mediating relationship, the estimate of the effect of the IV will be smaller after accounting for the mediator (c' ) than in a model without the mediator (c ).
The mechanics of the tests are as follows:
First, regress the DV on the IV along with any control variables. The coefficient on the IV is c which represents the "total effect" on the IV (i.e. the effect before removing the portion of the effect explained by the mediator).
Second, regress the mediating variable (MV) on the IV and any control variables. The coefficient on the IV is path a.
Third, regress the DV on the IV, the MV, and any control variables. The coefficient on the MV is path b. The coefficient on the IV is c'.
Terminology for effects
Commonly, the effect of the IV before accounting for the mediator (c ) is referred to as the total effect. The effect of the IV after accounting for the mediator (c' ) is referred to as the direct effect. The difference of the total and direct effect is called the indirect effect—or the amount of the IV's effect that is explained by the mediator.
Testing the indirect effect
To determine how much of the focal IV's effect is explained by the MV (i.e. the indirect effect), we can calculate either a*b ("product of coefficients") or c - c' ("difference in coefficients") which will be identical in size as long as the same sample is used for all three models described above. All three tests calculated by sgmediation2 use the product of coefficients approach. The tests differ only in their calculation of the standard error of the test of a*b.
The Aroian and Goodman version of the test differ from the Sobel version in that they include the product of the variance estimates of the coefficients on paths a and b (but in different ways). Results from all three tend to be similar as the product of the variances tends to be small.
There is some evidence suggesting the Aroian test over the other two (MacKinnon et al. 2002). However, all three have been found to be underpowered and alternative methods to calculate the standard error have been proposed. In particular, bootstrapping is a popular approach that has been shown to work well even in small samples (Preacher and Hayes 2004). Section 4 below illustrates how to obtain bootstrapped estimates of the standard error and confidence interval on the indirect effect (a*b). See Section 6 for an alternative approach using seemingly-unrelated estimation.
See MacKinnon et al. (2002) for a thorough discussion and comparison of each test.
Effect size
a*b (equivalent to c - c' ) can sometimes be interpreted directly as the amount of the IV's effect the MV explains or as the "indirect effect" of the IV -> MV -> DV. Alternatively, a*b (or c - c' ) can be expressed as the proportion reduction in the effect of the IV after accounting for the MV:
(4) Illustration of the sgmediation2 command
Those with more education tend to report better health. A possible mediation explanation is that more education leads to higher incomes, which is in turn associated with better health (for lots of reasons). The theoretical causal process is:
We also want to control for the respondent's age, gender, and race-ethnicity in this example.
To test this explanation using Sobel-Goodman tests of mediation, first load the example dataset.
use "https://tdmize.github.io/data/data/cda_ah4", clear
By default, sgmediation2 will only use observations that are non-missing on all variables in the models. It is best to handle missing data yourself before using sgmediation2. Multiply imputed data is allowed with sgmediation2 (see Section 5 below). Here, we use listwise deletion:
drop if missing(health, edyrs, income, race, woman, age)
We can now use sgmediation2 to estimate all three necessary models and to compile tables of all of the statistics discussed in Section #2 above. See the sgmediation2 help file for details. The basic syntax of the command is:
sgmediation2 depvar [if exp] [in range] , iv(focal_iv) mv(mediator_var) [options]
Where the italicized words above are to be replaced with the dependent, focal independent, and mediator variables of interest.
For our example, the only additional option we specify is cv( ) which allows us to include a list of control variables.
. sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)
Model with dv regressed on iv (path c)
regress health edyrs i.race i.woman age, vce()
Source | SS df MS Number of obs = 4,983
-------------+---------------------------------- F(6, 4976) = 56.32
Model | 264.985975 6 44.1643291 Prob > F = 0.0000
Residual | 3902.28234 4,976 .784220727 R-squared = 0.0636
-------------+---------------------------------- Adj R-squared = 0.0625
Total | 4167.26831 4,982 .836464936 Root MSE = .88556
----------------------------------------------------------------------------------
health | Coefficient Std. err. t P>|t| [95% conf. interval]
-----------------+----------------------------------------------------------------
edyrs | 0.093 0.005 16.979 0.000 0.083 0.104
|
race |
Black | -0.111 0.030 -3.747 0.000 -0.169 -0.053
Native American | -0.171 0.145 -1.179 0.238 -0.454 0.113
Asian | -0.201 0.073 -2.735 0.006 -0.345 -0.057
|
woman |
Woman | -0.172 0.025 -6.756 0.000 -0.222 -0.122
age | -0.013 0.007 -1.829 0.068 -0.026 0.001
_cons | 2.817 0.214 13.179 0.000 2.398 3.236
----------------------------------------------------------------------------------
Model with mediator regressed on iv (path a)
regress income edyrs i.race i.woman age, vce()
Source | SS df MS Number of obs = 4,983
-------------+---------------------------------- F(6, 4976) = 171.22
Model | 615297.309 6 102549.551 Prob > F = 0.0000
Residual | 2980328.07 4,976 598.940528 R-squared = 0.1711
-------------+---------------------------------- Adj R-squared = 0.1701
Total | 3595625.38 4,982 721.723279 Root MSE = 24.473
----------------------------------------------------------------------------------
income | Coefficient Std. err. t P>|t| [95% conf. interval]
-----------------+----------------------------------------------------------------
edyrs | 3.836 0.152 25.246 0.000 3.538 4.134
|
race |
Black | -5.922 0.821 -7.215 0.000 -7.531 -4.313
Native American | 0.113 3.997 0.028 0.977 -7.723 7.949
Asian | 4.917 2.030 2.422 0.015 0.937 8.897
|
woman |
Woman | -13.135 0.704 -18.664 0.000 -14.515 -11.755
age | 1.167 0.192 6.086 0.000 0.791 1.543
_cons | -47.033 5.906 -7.963 0.000 -58.612 -35.454
----------------------------------------------------------------------------------
Model with dv regressed on mediator and iv (paths b and c')
regress health income edyrs i.race i.woman age, vce()
Source | SS df MS Number of obs = 4,983
-------------+---------------------------------- F(7, 4975) = 55.12
Model | 299.936161 7 42.848023 Prob > F = 0.0000
Residual | 3867.33215 4,975 .777353196 R-squared = 0.0720
-------------+---------------------------------- Adj R-squared = 0.0707
Total | 4167.26831 4,982 .836464936 Root MSE = .88168
----------------------------------------------------------------------------------
health | Coefficient Std. err. t P>|t| [95% conf. interval]
-----------------+----------------------------------------------------------------
income | 0.003 0.001 6.705 0.000 0.002 0.004
edyrs | 0.080 0.006 13.797 0.000 0.069 0.092
|
race |
Black | -0.091 0.030 -3.061 0.002 -0.149 -0.033
Native American | -0.171 0.144 -1.187 0.235 -0.453 0.111
Asian | -0.218 0.073 -2.975 0.003 -0.361 -0.074
|
woman |
Woman | -0.127 0.026 -4.845 0.000 -0.178 -0.076
age | -0.017 0.007 -2.406 0.016 -0.030 -0.003
_cons | 2.978 0.214 13.906 0.000 2.558 3.398
----------------------------------------------------------------------------------
Sobel-Goodman Mediation Tests
| Est Std_err z P>|z|
---------------------+------------------------------------------------
Sobel | 0.013 0.002 6.481 0.000
Aroian | 0.013 0.002 6.476 0.000
Goodman | 0.013 0.002 6.485 0.000
Indirect, Direct, and Total Effects
| Est Std_err z P>|z|
---------------------+------------------------------------------------
a_coefficient | 3.836 0.152 25.246 0.000
b_coefficient | 0.003 0.001 6.705 0.000
Indirect_effect_aXb | 0.013 0.002 6.481 0.000
Direct_effect_c' | 0.080 0.006 13.797 0.000
Total_effect_c | 0.093 0.005 16.979 0.000
Proportion of total effect that is mediated: 0.141
Ratio of indirect to direct effect: 0.164
Ratio of total to direct effect: 1.164
Although there are many important results shown above, to summarize:
Both the a and b paths are significant at p < 0.001, suggesting that education is associated with higher incomes, and that higher incomes are associated with better health—supporting the proposed mediating relationship.
All three tests of a*b (the indirect effect) shown in the first "Sobel-Goodman mediation tests" table show very small p-values (p < .001) providing support for the explanation that income mediates the effect of education on health.
The effect of education is reduced by about 14% after accounting for income. Theoretically, this suggests that about 14% of the effect of education on health is explained by the indirect effect of education on income.
(5) Bootstrapped standard error and confidence interval estimates
The default Sobel-Goodman tests shown above are known to have low statistical power. A common recommended solution is to use bootstrapping to obtain the standard errors (and p-values) and/or confidence intervals (e.g. Preacher and Hayes 2004; 2008; Zhao et al. 2010). 1,000 or more bootstrapped samples is a common recommendation (e.g. Preacher and Hayes 2008).
By default, Stata's bootstrap command reports bias-corrected confidence intervals. Preacher and Hayes (2004; 2008) recommend using percentile CIs because the sampling distribution of a*b tends to be non-normal—which can be obtained with the postestimation command estat bootstrap and the percentile option.
The example below provides bootstrapped estimates of the indirect, direct, and total effect. Other statistics can be bootstrapped with sgmediation2 (see "stored results" section of the help file).
. bootstrap r(ind_eff) r(dir_eff) r(tot_eff), reps(1000): ///
> sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)
(running sgmediation2 on estimation sample)
regress health edyrs i.race i.woman age, vce()
regress income edyrs i.race i.woman age, vce()
regress health income edyrs i.race i.woman age, vce()
Bootstrap replications (1,000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
.................................................. 100
.................................................. 150
.................................................. 200
.................................................. 250
.................................................. 300
.................................................. 350
.................................................. 400
.................................................. 450
.................................................. 500
.................................................. 550
.................................................. 600
.................................................. 650
.................................................. 700
.................................................. 750
.................................................. 800
.................................................. 850
.................................................. 900
.................................................. 950
.................................................. 1,000
Bootstrap results Number of obs = 4,983
Replications = 1,000
Command: sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)
_bs_1: r(ind_eff)
_bs_2: r(dir_eff)
_bs_3: r(tot_eff)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_bs_1 | 0.013 0.002 6.604 0.000 0.009 0.017
_bs_2 | 0.080 0.006 13.808 0.000 0.069 0.092
_bs_3 | 0.093 0.006 16.953 0.000 0.083 0.104
------------------------------------------------------------------------------
. estat bootstrap, bc percentile
Bootstrap results Number of obs = 4,983
Replications = 1000
Command: sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)
_bs_1: r(ind_eff)
_bs_2: r(dir_eff)
_bs_3: r(tot_eff)
------------------------------------------------------------------------------
| Observed Bootstrap
| coefficient Bias std. err. [95% conf. interval]
-------------+----------------------------------------------------------------
_bs_1 | .01313674 .0000643 .00198927 .009313 .0174402 (P)
| .0093504 .0174673 (BC)
_bs_2 | .08021862 .0000374 .00580973 .0679262 .0912915 (P)
| .0676814 .0911606 (BC)
_bs_3 | .09335536 .0001017 .00550685 .0820851 .1035946 (P)
| .0819828 .1033161 (BC)
------------------------------------------------------------------------------
Key: P: Percentile
BC: Bias-corrected
(6) Noteworthy new features of sgmediation2
sgmediation2 includes a few noteworthy features that extend the original sgmediation (see the sgmediation2 help file for all features).
Survey weights and multiply-imputed data
First, sgmediation2 allows the use of survey weights and/or multiply-imputed data. To do so, specify the prefix you would have used on the regress command in the prefix( ) option of sgmediation2. E.g. To include the survey weights that have already been set with the svyset command use the command:
sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age) prefix(svy:)
Additionally, the prefix( ) option can be used to specify mi est, post: for multiple imputation estimates to be used as defined in mi set. Note the post option is needed with multiply imputed data. Or specify mi est, post: svy: for both survey weights and multiple imputation estimates as defined in mi svyset.
Alternative variance estimators
The vce( ) option can be used to obtain a variance estimator other than the default ols. E.g. Users can specify vce(robust) for robust variance estimates or vce(cluster clustvar) for cluster robust variance estimates. E.g. To adjust the variance estimates for clustering within occupational categories:
sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age) vce(cluster occcat)
Factor syntax for control variables
As demonstrated in the example above, factor syntax is allowed in the list of control variables. This allows you to specify different control variables be treated as continuous or nominal.
Factor syntax is not allowed for the focal independent variable (IV) or the mediating variable (MV) reflecting a limitation of the methods—not the command (see Section 7 below). IVs and MVs are limited to continuous or binary variables with this method.
(7) Limitations of the Sobel-Goodman/product of coefficients approach to mediation
There are many limitations to this approach to mediation (more than I discuss here). A few of note:
Only continuous or binary focal independent variables (IV) can be examined.
Only continuous or binary mediating variables (MV) can be examined.
Multiple mediating variables (MVs) cannot be easily incorporated.
Limited to tests of a single coefficient. E.g. There is no clear way to test if the effect of age is mediated if both age and age^2 coefficients are included in the models.
Limited to linear regression models.
A specialized approach appropriate only for mediation and not other cross-model comparisons.
These limitations (and some others) were the motivation of my article "A General Framework for Comparing Predictions and Marginal Effects Across Models" (Mize, Doan, and Long 2019). See that article and the associated Stata files if you are interested.
(8) References
Aroian, L. A. (1944). The probability function of the product of two normally distributed variables. Annals of Mathematical Statistics, 18, 265-271.
Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173.
Goodman, L. A. (1960). On the exact variance of products. Journal of the American Statistical Association, 55, 708–713.
Keele, L. (2015). Causal mediation analysis: warning! Assumptions ahead. American Journal of Evaluation, 36(4), 500-513.
MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7(1), 83.
Mize, T. D., Doan, L., & Long, J. S. (2019). A general framework for comparing predictions and marginal effects across models. Sociological Methodology, 49(1), 152-189.
Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior research methods, instruments, & computers, 36(4), 717-731.
Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior research methods, 40(3), 879-891.
Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology, 13, 290-312.
Zhao, X., Lynch Jr, J. G., & Chen, Q. (2010). Reconsidering Baron and Kenny: Myths and truths about mediation analysis. Journal of consumer research, 37(2), 197-206.