sgmediation2

Sobel-Goodman Tests of Mediation in Stata

sgmediation2 is a user-written Stata command which conducts Sobel-Goodman tests of statistical mediation for linear regression models. sgmediation2 is my update (with permission) to the original command sgmediation, written by Phil Ender of the UCLA Statistical Consulting Group. Questions or requests for additions to the command should be sent to me at tmize@purdue.edu

The sections below detail:

Installation instructions
Citation
Background on the statistical tests
Illustration of the sgmediation2 command
Bootstrapped standard errors and confidence intervals
Noteworthy new features of sgmediation2
Limitations of the Sobel-Goodman/product of coefficients approach to mediation
References

(1) Installation Instructions

To install in Stata:

net install sgmediation2, from("https://tdmize.github.io/data") replace

Once installed, to read the help file (also available here):

help sgmediation2

(2) Citation

Please cite use of the sgmediation2 command as:

Mize, Trenton D. 2022. "sgmediation2: Sobel-Goodman tests of mediation in Stata." https://www.trentonmize.com/software/sgmediation2

(3) Background on the statistical tests

sgmediation2 calculates three different tests of mediation using the "product of coefficients" approach (MacKinnon et al. 2002). Neither this site or the command sgmediation2 is meant as a blanket endorsement of this approach to mediation. Many approaches to mediation exist with many pros and cons to each (e.g. see MacKinnon et al. 2002; Zhao et al. 2010; Keele 2015). Some prominent limitations of this approach along with a suggested alternative for many cases, the mecompare command, are detailed in Section #6 at the bottom of this page.

The commonly used approach to mediation based on Baron and Kenny (1986) suggests that a variable may be considered a mediator to the extent to which it carries the influence of a focal independent variable (IV) to a given dependent variable (DV). In this framework, mediation can be said to occur when (1) the IV significantly affects the mediator, (2) the IV significantly affects the DV in the absence of the mediator, (3) the mediator has a significant unique effect on the DV, and (4) the effect of the IV on the DV shrinks upon the addition of the mediator to the model.

Others (e.g., Preacher and Hayes 2004) suggest that only two requirements need be met: (1) the IV has a significant effect before the mediator is added to the model, and (2) the effect of the IV shrinks upon the addition of the mediator to the model (i.e. same requirement as #4 above). Simplifying even further, many now suggest (e.g., Zhao, Lynch, and Chen 2010) that the only needed requirement is that the effect of the IV shrinks upon the addition of the mediator to the model (AKA there is a significant indirect effect; see below for details) because mediation can occur even in the absence of a direct effect of the IV. Yet another approach suggests using the steps given by Baron and Kenny (1986) but determining mediation as whether or not the effects in steps #1 and #3 are both significant (e.g., Yzerbyt et al. 2018).

sgmediation2 provides tests of all of the various requirements discussed above to facilitate most any test desired. I personally agree that the test that the effect of the IV shrinks upon the addition of the mediator to the model (i.e. the indirect effect) is of most central interest. But as Zhao et al. (2010) detail -- the individual tests outlined by Baron and Kenny (1986) are still quite useful to determine the specific nature of mediation found.

Mediation tests

The right half of the figure below illustrates the basic logic of a mediating relationship where the mediating variable (MV) is theoretically at least partially the reason/mechanism by which the focal independent variable (IV) affects the outcome (DV). Because of this mediating relationship, the estimate of the effect of the IV will be smaller after accounting for the mediator (c' ) than in a model without the mediator (c ).

The mechanics of the tests are as follows:

First, regress the DV on the IV along with any control variables. The coefficient on the IV is c which represents the "total effect" on the IV (i.e. the effect before removing the portion of the effect explained by the mediator).
Second, regress the mediating variable (MV) on the IV and any control variables. The coefficient on the IV is path a.
Third, regress the DV on the IV, the MV, and any control variables. The coefficient on the MV is path b. The coefficient on the IV is c'.

Terminology for effects

Commonly, the effect of the IV before accounting for the mediator (c ) is referred to as the total effect. The effect of the IV after accounting for the mediator (c' ) is referred to as the direct effect. The difference of the total and direct effect is called the indirect effect—or the amount of the IV's effect that is explained by the mediator.

Testing the indirect effect

To determine how much of the focal IV's effect is explained by the MV (i.e. the indirect effect), we can calculate either a*b ("product of coefficients") or c - c' ("difference in coefficients") which will be identical in size as long as the same sample is used for all three models described above. All three tests calculated by sgmediation2 use the product of coefficients approach. The tests differ only in their calculation of the standard error of the test of a*b.

The Aroian and Goodman version of the test differ from the Sobel version in that they include the product of the variance estimates of the coefficients on paths a and b (but in different ways). Results from all three tend to be similar as the product of the variances tends to be small.

There is some evidence suggesting the Aroian test over the other two (MacKinnon et al. 2002). However, all three have been found to be underpowered and alternative methods to calculate the standard error have been proposed. In particular, bootstrapping is a popular approach that has been shown to work well even in small samples (Preacher and Hayes 2004). Section 4 below illustrates how to obtain bootstrapped estimates of the standard error and confidence interval on the indirect effect (a*b). See Section 6 for an alternative approach using seemingly-unrelated estimation.

See MacKinnon et al. (2002) for a thorough discussion and comparison of each test.

Effect size

a*b (equivalent to c - c' ) can sometimes be interpreted directly as the amount of the IV's effect the MV explains or as the "indirect effect" of the IV -> MV -> DV. Alternatively, a*b (or c - c' ) can be expressed as the proportion reduction in the effect of the IV after accounting for the MV:

(4) Illustration of the sgmediation2 command

Those with more education tend to report better health. A possible mediation explanation is that more education leads to higher incomes, which is in turn associated with better health (for lots of reasons). The theoretical causal process is:

We also want to control for the respondent's age, gender, and race-ethnicity in this example.

To test this explanation using Sobel-Goodman tests of mediation, first load the example dataset.

use "https://tdmize.github.io/data/data/cda_ah4", clear

By default, sgmediation2 will only use observations that are non-missing on all variables in the models. It is best to handle missing data yourself before using sgmediation2. Multiply imputed data is allowed with sgmediation2 (see Section 5 below). Here, we use listwise deletion:

drop if missing(health, edyrs, income, race, woman, age)

We can now use sgmediation2 to estimate all three necessary models and to compile tables of all of the statistics discussed in Section #2 above. See the sgmediation2 help file for details. The basic syntax of the command is:

sgmediation2 depvar [if exp] [in range] , iv(focal_iv) mv(mediator_var) [options]

Where the italicized words above are to be replaced with the dependent, focal independent, and mediator variables of interest.

For our example, the only additional option we specify is cv( ) which allows us to include a list of control variables.

. sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)

Model with dv regressed on iv (path c)

regress health edyrs i.race i.woman age, vce()

Source | SS df MS Number of obs = 4,983

-------------+---------------------------------- F(6, 4976) = 56.32

Model | 264.985975 6 44.1643291 Prob > F = 0.0000

Residual | 3902.28234 4,976 .784220727 R-squared = 0.0636

-------------+---------------------------------- Adj R-squared = 0.0625

Total | 4167.26831 4,982 .836464936 Root MSE = .88556

----------------------------------------------------------------------------------

health | Coefficient Std. err. t P>|t| [95% conf. interval]

-----------------+----------------------------------------------------------------

edyrs | 0.093 0.005 16.979 0.000 0.083 0.104

|

race |

Black | -0.111 0.030 -3.747 0.000 -0.169 -0.053

Native American | -0.171 0.145 -1.179 0.238 -0.454 0.113

Asian | -0.201 0.073 -2.735 0.006 -0.345 -0.057

|

woman |

Woman | -0.172 0.025 -6.756 0.000 -0.222 -0.122

age | -0.013 0.007 -1.829 0.068 -0.026 0.001

_cons | 2.817 0.214 13.179 0.000 2.398 3.236

----------------------------------------------------------------------------------

Model with mediator regressed on iv (path a)

regress income edyrs i.race i.woman age, vce()

Source | SS df MS Number of obs = 4,983

-------------+---------------------------------- F(6, 4976) = 171.22

Model | 615297.309 6 102549.551 Prob > F = 0.0000

Residual | 2980328.07 4,976 598.940528 R-squared = 0.1711

-------------+---------------------------------- Adj R-squared = 0.1701

Total | 3595625.38 4,982 721.723279 Root MSE = 24.473

----------------------------------------------------------------------------------

income | Coefficient Std. err. t P>|t| [95% conf. interval]

-----------------+----------------------------------------------------------------

edyrs | 3.836 0.152 25.246 0.000 3.538 4.134

|

race |

Black | -5.922 0.821 -7.215 0.000 -7.531 -4.313

Native American | 0.113 3.997 0.028 0.977 -7.723 7.949

Asian | 4.917 2.030 2.422 0.015 0.937 8.897

|

woman |

Woman | -13.135 0.704 -18.664 0.000 -14.515 -11.755

age | 1.167 0.192 6.086 0.000 0.791 1.543

_cons | -47.033 5.906 -7.963 0.000 -58.612 -35.454

----------------------------------------------------------------------------------

Model with dv regressed on mediator and iv (paths b and c')

regress health income edyrs i.race i.woman age, vce()

Source | SS df MS Number of obs = 4,983

-------------+---------------------------------- F(7, 4975) = 55.12

Model | 299.936161 7 42.848023 Prob > F = 0.0000

Residual | 3867.33215 4,975 .777353196 R-squared = 0.0720

-------------+---------------------------------- Adj R-squared = 0.0707

Total | 4167.26831 4,982 .836464936 Root MSE = .88168

----------------------------------------------------------------------------------

health | Coefficient Std. err. t P>|t| [95% conf. interval]

-----------------+----------------------------------------------------------------

income | 0.003 0.001 6.705 0.000 0.002 0.004

edyrs | 0.080 0.006 13.797 0.000 0.069 0.092

|

race |

Black | -0.091 0.030 -3.061 0.002 -0.149 -0.033

Native American | -0.171 0.144 -1.187 0.235 -0.453 0.111

Asian | -0.218 0.073 -2.975 0.003 -0.361 -0.074

|

woman |

Woman | -0.127 0.026 -4.845 0.000 -0.178 -0.076

age | -0.017 0.007 -2.406 0.016 -0.030 -0.003

_cons | 2.978 0.214 13.906 0.000 2.558 3.398

----------------------------------------------------------------------------------

Sobel-Goodman Mediation Tests

| Est Std_err z P>|z|

---------------------+------------------------------------------------

Sobel | 0.013 0.002 6.481 0.000

Aroian | 0.013 0.002 6.476 0.000

Goodman | 0.013 0.002 6.485 0.000

Indirect, Direct, and Total Effects

| Est Std_err z P>|z|

---------------------+------------------------------------------------

a_coefficient | 3.836 0.152 25.246 0.000

b_coefficient | 0.003 0.001 6.705 0.000

Indirect_effect_aXb | 0.013 0.002 6.481 0.000

Direct_effect_c' | 0.080 0.006 13.797 0.000

Total_effect_c | 0.093 0.005 16.979 0.000

Proportion of total effect that is mediated: 0.141

Ratio of indirect to direct effect: 0.164

Ratio of total to direct effect: 1.164

Although there are many important results shown above, to summarize:

Both the a and b paths are significant at p < 0.001, suggesting that education is associated with higher incomes, and that higher incomes are associated with better health—supporting the proposed mediating relationship.
All three tests of a*b (the indirect effect) shown in the first "Sobel-Goodman mediation tests" table show very small p-values (p < .001) providing support for the explanation that income mediates the effect of education on health.
The effect of education is reduced by about 14% after accounting for income. Theoretically, this suggests that about 14% of the effect of education on health is explained by the indirect effect of education on income.

(5) Bootstrapped standard error and confidence interval estimates

The default Sobel-Goodman tests shown above are known to have low statistical power. A common recommended solution is to use bootstrapping to obtain the standard errors (and p-values) and/or confidence intervals (e.g. Preacher and Hayes 2004; 2008; Zhao et al. 2010). 1,000 or more bootstrapped samples is a common recommendation (e.g. Preacher and Hayes 2008).

By default, Stata's bootstrap command reports bias-corrected confidence intervals. Preacher and Hayes (2004; 2008) recommend using percentile CIs because the sampling distribution of a*b tends to be non-normal—which can be obtained with the postestimation command estat bootstrap and the percentile option.

The example below provides bootstrapped estimates of the indirect, direct, and total effect. Other statistics can be bootstrapped with sgmediation2 (see "stored results" section of the help file).

. bootstrap r(ind_eff) r(dir_eff) r(tot_eff), reps(1000): ///

> sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)

(running sgmediation2 on estimation sample)

regress health edyrs i.race i.woman age, vce()

regress income edyrs i.race i.woman age, vce()

regress health income edyrs i.race i.woman age, vce()

Bootstrap replications (1,000)

----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5

.................................................. 50

.................................................. 100

.................................................. 150

.................................................. 200

.................................................. 250

.................................................. 300

.................................................. 350

.................................................. 400

.................................................. 450

.................................................. 500

.................................................. 550

.................................................. 600

.................................................. 650

.................................................. 700

.................................................. 750

.................................................. 800

.................................................. 850

.................................................. 900

.................................................. 950

.................................................. 1,000

Bootstrap results Number of obs = 4,983

Replications = 1,000

Command: sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)

_bs_1: r(ind_eff)

_bs_2: r(dir_eff)

_bs_3: r(tot_eff)

------------------------------------------------------------------------------

| Observed Bootstrap Normal-based

| coefficient std. err. z P>|z| [95% conf. interval]

-------------+----------------------------------------------------------------

_bs_1 | 0.013 0.002 6.604 0.000 0.009 0.017

_bs_2 | 0.080 0.006 13.808 0.000 0.069 0.092

_bs_3 | 0.093 0.006 16.953 0.000 0.083 0.104

------------------------------------------------------------------------------

. estat bootstrap, bc percentile

Bootstrap results Number of obs = 4,983

Replications = 1000

Command: sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)

_bs_1: r(ind_eff)

_bs_2: r(dir_eff)

_bs_3: r(tot_eff)

------------------------------------------------------------------------------

| Observed Bootstrap

| coefficient Bias std. err. [95% conf. interval]

-------------+----------------------------------------------------------------

_bs_1 | .01313674 .0000643 .00198927 .009313 .0174402 (P)

| .0093504 .0174673 (BC)

_bs_2 | .08021862 .0000374 .00580973 .0679262 .0912915 (P)

| .0676814 .0911606 (BC)

_bs_3 | .09335536 .0001017 .00550685 .0820851 .1035946 (P)

| .0819828 .1033161 (BC)

------------------------------------------------------------------------------

Key: P: Percentile

BC: Bias-corrected

(6) Noteworthy new features of sgmediation2

sgmediation2 includes a few noteworthy features that extend the original sgmediation (see the sgmediation2 help file for all features).

Survey weights and multiply-imputed data

First, sgmediation2 allows the use of survey weights and/or multiply-imputed data. To do so, specify the prefix you would have used on the regress command in the prefix( ) option of sgmediation2. E.g. To include the survey weights that have already been set with the svyset command use the command:

sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age) prefix(svy:)

Additionally, the prefix( ) option can be used to specify mi est, post: for multiple imputation estimates to be used as defined in mi set. Note the post option is needed with multiply imputed data. Or specify mi est, post: svy: for both survey weights and multiple imputation estimates as defined in mi svyset.

Alternative variance estimators

The vce( ) option can be used to obtain a variance estimator other than the default ols. E.g. Users can specify vce(robust) for robust variance estimates or vce(cluster clustvar) for cluster robust variance estimates. E.g. To adjust the variance estimates for clustering within occupational categories:

sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age) vce(cluster occcat)

Factor syntax for control variables

As demonstrated in the example above, factor syntax is allowed in the list of control variables. This allows you to specify different control variables be treated as continuous or nominal.

Factor syntax is not allowed for the focal independent variable (IV) or the mediating variable (MV) reflecting a limitation of the methods—not the command (see Section 7 below). IVs and MVs are limited to continuous or binary variables with this method.

(7) Limitations of the Sobel-Goodman/product of coefficients approach to mediation

There are many limitations to this approach to mediation (more than I discuss here). A few of note:

Only continuous or binary focal independent variables (IV) can be examined.
Only continuous or binary mediating variables (MV) can be examined.
Multiple mediating variables (MVs) cannot be easily incorporated.
Limited to tests of a single coefficient. E.g. There is no clear way to test if the effect of age is mediated if both age and age^2 coefficients are included in the models.
Limited to linear regression models.
A specialized approach appropriate only for mediation and not other cross-model comparisons.

These limitations (and some others) were the motivation of my article "A General Framework for Comparing Predictions and Marginal Effects Across Models" (Mize, Doan, and Long 2019).

The mecompare command implements the Mize, Doan, and Long (2019) approach to cross-model comparisons.

(8) References

Aroian, L. A. (1944). The probability function of the product of two normally distributed variables. Annals of Mathematical Statistics, 18, 265-271.

Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173.

Goodman, L. A. (1960). On the exact variance of products. Journal of the American Statistical Association, 55, 708–713.

Keele, L. (2015). Causal mediation analysis: warning! Assumptions ahead. American Journal of Evaluation, 36(4), 500-513.

MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7(1), 83.

Mize, T. D., Doan, L., & Long, J. S. (2019). A general framework for comparing predictions and marginal effects across models. Sociological Methodology, 49(1), 152-189.

Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior research methods, instruments, & computers, 36(4), 717-731.

Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior research methods, 40(3), 879-891.

Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology, 13, 290-312.

Zhao, X., Lynch Jr, J. G., & Chen, Q. (2010). Reconsidering Baron and Kenny: Myths and truths about mediation analysis. Journal of consumer research, 37(2), 197-206.