sgmediation2

 Sobel-Goodman Tests of Mediation in Stata

sgmediation2 is a user-written Stata command which conducts Sobel-Goodman tests of statistical mediation for linear regression models. sgmediation2 is my update (with permission) to the original command sgmediation, written by Phil Ender of the UCLA Statistical Consulting Group. Questions or requests for additions to the command should be sent to me at tmize@purdue.edu

The sections below detail:

(1) Installation Instructions 

To install in Stata:

    net install sgmediation2, from("https://tdmize.github.io/data/sgmediation2")

Once installed, to read the help file (also available here):

    help sgmediation2

(2) Citation

Please cite use of the sgmediation2 command as:

Mize, Trenton D. 2022. "sgmediation2: Sobel-Goodman tests of mediation in Stata." https://www.trentonmize.com/software/sgmediation2

(3) Background on the statistical tests

sgmediation2 calculates three different tests of mediation using the "product of coefficients" approach (MacKinnon et al. 2002). Neither this site or the command sgmediation2 is meant as a blanket endorsement of this approach to mediation. Many approaches to mediation exist with many pros and cons to each (e.g. see MacKinnon et al. 2002; Zhao et al. 2010; Keele 2015). Some prominent limitations of this approach along with a suggested alternative for many cases are detailed in Section #6 at the bottom of this page. 

The commonly used approach to mediation based on Baron and Kenny (1986) suggests that a variable may be considered a mediator to the extent to which it carries the influence of a focal independent variable (IV) to a given dependent variable (DV). In this framework, mediation can be said to occur when (1) the IV significantly affects the mediator, (2) the IV significantly affects the DV in the absence of the mediator, (3) the mediator has a significant unique effect on the DV, and (4) the effect of the IV on the DV shrinks upon the addition of the mediator to the model. 

Others (e.g., Preacher and Hayes 2004) suggest that only two requirements need be met: (1) the IV has a significant effect before the mediator is added to the model, and (2) the effect of the IV shrinks upon the addition of the mediator to the model (i.e. same requirement as #4 above). Simplifying even further, many now suggest (e.g., Zhao, Lynch, and Chen 2010) that the only needed requirement is that the effect of the IV shrinks upon the addition of the mediator to the model (AKA there is a significant indirect effect; see below for details) because mediation can occur even in the absence of a direct effect of the IV. Yet another approach suggests using the steps given by Baron and Kenny (1986) but determining mediation as whether or not the effects in steps #1 and #3 are both significant (e.g., Yzerbyt et al. 2018).

sgmediation2 provides tests of all of the various requirements discussed above to facilitate most any test desired. I personally agree that the test that the effect of the IV shrinks upon the addition of the mediator to the model (i.e. the indirect effect) is of most central interest. But as Zhao et al. (2010) detail -- the individual tests outlined by Baron and Kenny (1986) are still quite useful to determine the specific nature of mediation found.

Mediation tests

The right half of the figure below illustrates the basic logic of a mediating relationship where the mediating variable (MV) is theoretically at least partially the reason/mechanism by which the focal independent variable (IV) affects the outcome (DV).  Because of this mediating relationship, the estimate of the effect of the IV will be smaller after accounting for the mediator (c' ) than in a model without the mediator (c ).

The mechanics of the tests are as follows:

Terminology for effects

Commonly, the effect of the IV before accounting for the mediator (c ) is referred to as the total effect. The effect of the IV after accounting for the mediator (c' ) is referred to as the direct effect. The difference of the total and direct effect is called the indirect effect—or the amount of the IV's effect that is explained by the mediator. 

Testing the indirect effect

To determine how much of the focal IV's effect is explained by the MV (i.e. the indirect effect), we can calculate either a*b ("product of coefficients") or c - c' ("difference in coefficients") which will be identical in size as long as the same sample is used for all three models described above. All three tests calculated by sgmediation2 use the product of coefficients approach. The tests differ only in their calculation of the standard error of the test of a*b.

The Aroian and Goodman version of the test differ from the Sobel version in that they include the product of the variance estimates of the coefficients on paths a and b (but in different ways). Results from all three tend to be similar as the product of the variances tends to be small. 

There is some evidence suggesting the Aroian test over the other two (MacKinnon et al. 2002). However, all three have been found to be underpowered and alternative methods to calculate the standard error have been proposed. In particular, bootstrapping is a popular approach that has been shown to work well even in small samples (Preacher and Hayes 2004). Section 4 below illustrates how to obtain bootstrapped estimates of the standard error and confidence interval on the indirect effect (a*b).  See Section 6 for an alternative approach using seemingly-unrelated estimation.

See MacKinnon et al. (2002) for a thorough discussion and comparison of each test.

Effect size

a*b (equivalent to c - c' ) can sometimes be interpreted directly as the amount of the IV's effect the MV explains or as the "indirect effect" of the IV -> MV -> DV. Alternatively, a*b (or c - c' ) can be expressed as the proportion reduction in the effect of the IV after accounting for the MV:

(4) Illustration of the sgmediation2 command

Those with more education tend to report better health. A possible mediation explanation is that more education leads to higher incomes, which is in turn associated with better health (for lots of reasons). The theoretical causal process is:

We also want to control for the respondent's age, gender, and race-ethnicity in this example.

To test this explanation using Sobel-Goodman tests of mediation, first load the example dataset.

use "https://tdmize.github.io/data/data/cda_ah4", clear

By default, sgmediation2 will only  use observations that are non-missing on all variables in the models. It is best to handle missing data yourself before using sgmediation2. Multiply imputed data is allowed with sgmediation2 (see Section 5 below). Here, we use listwise deletion:

drop if missing(health, edyrs, income, race, woman, age)

We can now use sgmediation2 to estimate all three necessary models and to compile tables of all of the statistics discussed in Section #2 above. See the sgmediation2 help file for details. The basic syntax of the command is:

sgmediation2 depvar [if exp] [in range] , iv(focal_iv) mv(mediator_var) [options]

Where the italicized words above are to be replaced with the dependent, focal independent, and mediator variables of interest.

For our example, the only additional option we specify is cv( ) which allows us to include a list of control variables.

. sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)


Model with dv regressed on iv (path c)

 regress health edyrs i.race i.woman age, vce() 


      Source |       SS           df       MS      Number of obs   =     4,983

-------------+----------------------------------   F(6, 4976)      =     56.32

       Model |  264.985975         6  44.1643291   Prob > F        =    0.0000

    Residual |  3902.28234     4,976  .784220727   R-squared       =    0.0636

-------------+----------------------------------   Adj R-squared   =    0.0625

       Total |  4167.26831     4,982  .836464936   Root MSE        =    .88556


----------------------------------------------------------------------------------

          health | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-----------------+----------------------------------------------------------------

           edyrs |      0.093      0.005   16.979   0.000        0.083       0.104

                 |

            race |

          Black  |     -0.111      0.030   -3.747   0.000       -0.169      -0.053

Native American  |     -0.171      0.145   -1.179   0.238       -0.454       0.113

          Asian  |     -0.201      0.073   -2.735   0.006       -0.345      -0.057

                 |

           woman |

          Woman  |     -0.172      0.025   -6.756   0.000       -0.222      -0.122

             age |     -0.013      0.007   -1.829   0.068       -0.026       0.001

           _cons |      2.817      0.214   13.179   0.000        2.398       3.236

----------------------------------------------------------------------------------


Model with mediator regressed on iv (path a)

 regress income edyrs i.race i.woman age, vce() 


      Source |       SS           df       MS      Number of obs   =     4,983

-------------+----------------------------------   F(6, 4976)      =    171.22

       Model |  615297.309         6  102549.551   Prob > F        =    0.0000

    Residual |  2980328.07     4,976  598.940528   R-squared       =    0.1711

-------------+----------------------------------   Adj R-squared   =    0.1701

       Total |  3595625.38     4,982  721.723279   Root MSE        =    24.473


----------------------------------------------------------------------------------

          income | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-----------------+----------------------------------------------------------------

           edyrs |      3.836      0.152   25.246   0.000        3.538       4.134

                 |

            race |

          Black  |     -5.922      0.821   -7.215   0.000       -7.531      -4.313

Native American  |      0.113      3.997    0.028   0.977       -7.723       7.949

          Asian  |      4.917      2.030    2.422   0.015        0.937       8.897

                 |

           woman |

          Woman  |    -13.135      0.704  -18.664   0.000      -14.515     -11.755

             age |      1.167      0.192    6.086   0.000        0.791       1.543

           _cons |    -47.033      5.906   -7.963   0.000      -58.612     -35.454

----------------------------------------------------------------------------------


Model with dv regressed on mediator and iv (paths b and c')

 regress health income edyrs i.race i.woman age, vce() 


      Source |       SS           df       MS      Number of obs   =     4,983

-------------+----------------------------------   F(7, 4975)      =     55.12

       Model |  299.936161         7   42.848023   Prob > F        =    0.0000

    Residual |  3867.33215     4,975  .777353196   R-squared       =    0.0720

-------------+----------------------------------   Adj R-squared   =    0.0707

       Total |  4167.26831     4,982  .836464936   Root MSE        =    .88168


----------------------------------------------------------------------------------

          health | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-----------------+----------------------------------------------------------------

          income |      0.003      0.001    6.705   0.000        0.002       0.004

           edyrs |      0.080      0.006   13.797   0.000        0.069       0.092

                 |

            race |

          Black  |     -0.091      0.030   -3.061   0.002       -0.149      -0.033

Native American  |     -0.171      0.144   -1.187   0.235       -0.453       0.111

          Asian  |     -0.218      0.073   -2.975   0.003       -0.361      -0.074

                 |

           woman |

          Woman  |     -0.127      0.026   -4.845   0.000       -0.178      -0.076

             age |     -0.017      0.007   -2.406   0.016       -0.030      -0.003

           _cons |      2.978      0.214   13.906   0.000        2.558       3.398

----------------------------------------------------------------------------------


Sobel-Goodman Mediation Tests


                     |        Est     Std_err           z       P>|z| 

---------------------+------------------------------------------------

               Sobel |      0.013       0.002       6.481       0.000 

              Aroian |      0.013       0.002       6.476       0.000 

             Goodman |      0.013       0.002       6.485       0.000 


Indirect, Direct, and Total Effects


                     |        Est     Std_err           z       P>|z| 

---------------------+------------------------------------------------

       a_coefficient |      3.836       0.152      25.246       0.000 

       b_coefficient |      0.003       0.001       6.705       0.000 

 Indirect_effect_aXb |      0.013       0.002       6.481       0.000 

    Direct_effect_c' |      0.080       0.006      13.797       0.000 

      Total_effect_c |      0.093       0.005      16.979       0.000 



Proportion of total effect that is mediated:       0.141

Ratio of indirect to direct effect:                0.164

Ratio of total to direct effect:                   1.164



Although there are many important results shown above, to summarize:

(5) Bootstrapped standard error and confidence interval estimates

The default Sobel-Goodman tests shown above are known to have low statistical power. A common recommended solution is to use bootstrapping to obtain the standard errors (and p-values) and/or confidence intervals (e.g. Preacher and Hayes 2004; 2008; Zhao et al. 2010). 1,000 or more bootstrapped samples is a common recommendation (e.g. Preacher and Hayes 2008).

By default, Stata's bootstrap command reports bias-corrected confidence intervals. Preacher and Hayes (2004; 2008) recommend using percentile CIs because the sampling distribution of a*b tends to be non-normal—which can be obtained with the postestimation command estat bootstrap and the percentile option.

The example below provides bootstrapped estimates of the indirect, direct, and total effect. Other statistics can be bootstrapped with sgmediation2 (see "stored results" section of the help file).

. bootstrap r(ind_eff) r(dir_eff) r(tot_eff), reps(1000): ///

>         sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)


(running sgmediation2 on estimation sample)

 regress health edyrs i.race i.woman age, vce() 

 regress income edyrs i.race i.woman age, vce() 

 regress health income edyrs i.race i.woman age, vce() 


Bootstrap replications (1,000)

----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 

..................................................    50

..................................................   100

..................................................   150

..................................................   200

..................................................   250

..................................................   300

..................................................   350

..................................................   400

..................................................   450

..................................................   500

..................................................   550

..................................................   600

..................................................   650

..................................................   700

..................................................   750

..................................................   800

..................................................   850

..................................................   900

..................................................   950

.................................................. 1,000


Bootstrap results                                        Number of obs = 4,983

                                                         Replications  = 1,000


      Command: sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)

        _bs_1: r(ind_eff)

        _bs_2: r(dir_eff)

        _bs_3: r(tot_eff)


------------------------------------------------------------------------------

             |   Observed   Bootstrap                         Normal-based

             | coefficient  std. err.      z    P>|z|     [95% conf. interval]

-------------+----------------------------------------------------------------

       _bs_1 |      0.013      0.002    6.604   0.000        0.009       0.017

       _bs_2 |      0.080      0.006   13.808   0.000        0.069       0.092

       _bs_3 |      0.093      0.006   16.953   0.000        0.083       0.104

------------------------------------------------------------------------------




. estat bootstrap, bc percentile


Bootstrap results                               Number of obs     =      4,983

                                                Replications      =       1000


      Command: sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age)

        _bs_1: r(ind_eff)

        _bs_2: r(dir_eff)

        _bs_3: r(tot_eff)


------------------------------------------------------------------------------

             |    Observed               Bootstrap

             | coefficient       Bias    std. err.  [95% conf. interval]

-------------+----------------------------------------------------------------

       _bs_1 |   .01313674   .0000643   .00198927     .009313   .0174402   (P)

             |                                       .0093504   .0174673  (BC)

       _bs_2 |   .08021862   .0000374   .00580973    .0679262   .0912915   (P)

             |                                       .0676814   .0911606  (BC)

       _bs_3 |   .09335536   .0001017   .00550685    .0820851   .1035946   (P)

             |                                       .0819828   .1033161  (BC)

------------------------------------------------------------------------------

Key:  P: Percentile

     BC: Bias-corrected

(6) Noteworthy new features of sgmediation2

sgmediation2 includes a few noteworthy features that extend the original sgmediation (see the sgmediation2 help file for all features). 

Survey weights and multiply-imputed data

First, sgmediation2 allows the use of survey weights and/or multiply-imputed data. To do so, specify the prefix you would have used on the regress command in the prefix( ) option of sgmediation2. E.g. To include the survey weights that have already been set with the svyset command use the command:

sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age) prefix(svy:)

Additionally, the prefix( ) option can be used to specify mi est, post: for multiple imputation estimates to be used as defined in mi set. Note the post option is needed with multiply imputed data. Or specify mi est, post: svy: for both survey weights and multiple imputation estimates as defined in mi svyset.

Alternative variance estimators

The vce( ) option can be used to obtain a variance estimator other than the default ols. E.g. Users can specify vce(robust) for robust variance estimates or vce(cluster clustvar) for cluster robust variance estimates. E.g. To adjust the variance estimates for clustering within occupational categories:

sgmediation2 health, iv(edyrs) mv(income) cv(i.race i.woman age) vce(cluster occcat)

Factor syntax for control variables

As demonstrated in the example above, factor syntax is allowed in the list of control variables. This allows you to specify different control variables be treated as continuous or nominal. 

Factor syntax is not allowed for the focal independent variable (IV) or the mediating variable (MV) reflecting a limitation of the methods—not the command (see Section 7 below). IVs and MVs are limited to continuous or binary variables with this method.

(7) Limitations of the Sobel-Goodman/product of coefficients approach to mediation

There are many limitations to this approach to mediation (more than I discuss here). A few of note:

These limitations (and some others) were the motivation of my article "A General Framework for Comparing Predictions and Marginal Effects Across Models" (Mize, Doan, and Long 2019). See that article and the associated Stata files if you are interested.

(8) References

Aroian, L. A. (1944). The probability function of the product of two normally distributed variables.  Annals of Mathematical Statistics, 18, 265-271.

Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173.

Goodman, L. A. (1960). On the exact variance of products. Journal of the American Statistical Association, 55, 708–713.

Keele, L. (2015). Causal mediation analysis: warning! Assumptions ahead. American Journal of Evaluation, 36(4), 500-513.

MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7(1), 83.

Mize, T. D., Doan, L., & Long, J. S. (2019). A general framework for comparing predictions and marginal effects across models. Sociological Methodology, 49(1), 152-189.

Preacher, K. J., & Hayes, A. F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior research methods, instruments, & computers, 36(4), 717-731.

Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior research methods, 40(3), 879-891.

Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology, 13, 290-312.

Zhao, X., Lynch Jr, J. G., & Chen, Q. (2010). Reconsidering Baron and Kenny: Myths and truths about mediation analysis. Journal of consumer research, 37(2), 197-206.