If two traits (A and b) are associated with another trait C at the same time, we can test whether trait C affects trait B through trait a or vice versa through intermediary analysis. This paper will take you to a simple intermediary analysis.
Theoretical knowledge
principle
The importance of intermediary analysis in epidemiological research is to reveal the impact of exposure on results. Mediation analysis is usually used to evaluate the extent to which the impact of exposure is explained, or when there is no assumed mediation variable (also known as intermediate variable), in this way, the total impact of exposure on results can be defined, The effects of exposure by a given set of mediators (indirect effects) and the effects of exposure that cannot be explained by these same mediators (direct effects)
Generally, the part that can be explained by intermediary variables is indirect effect, and the part that cannot be explained by these intermediary variables is direct effect. People expect that the total effect can be decomposed into direct effect and indirect effect, that is, assuming that the exposure factors have 15% risk, in which the risk of direct effect is 10%, the risk of indirect effect is 5%, in other words, One third of the total effects can be explained by intermediary variables, and the other two-thirds of the total effects can be explained by other ways
[scode type="yellow"]
Generally, the test of intermediary effect is considered only when the independent variable has an impact on the dependent variable (with a total effect). However, the fact that the independent variable has no impact on the dependent variable does not mean that there is no intermediate variable, but the impact at this time is called indirect effect. Intermediary effects are indirect effects, but indirect effects are not necessarily intermediary effects. In real life, the effect direction between independent variable and intermediate variable, intermediate variable and dependent variable is opposite, so that the total effect is 0. This situation is called effect masking, that is, generalized intermediate effect.
[/scode]
There are many reviews on intermediary analysis. Here are two articles, one in Chinese and one in English, for reference:
Estimation of mediating effect
The calculation of mediation effect is particularly simple, which is briefly described as follows:
- First, the total effect is estimated. The total effect should be the overall effect of the independent variable on the dependent variable, generally:
- Then estimate the correlation between independent variables and intermediate variables:
- Finally, estimate the direct effect, that is, the effect of the independent variable corrected by the intermediary variable on the dependent variable, also known as the direct effect:
- Now, the total effect is \ (c \), the direct effect is \ (c '\), the intermediary effect is \ (c - c' \), and the proportion of intermediary effect is \ (\ frac {c - c'} {c} \)
Test of mediating effect
Compared with the estimation of mediation effect, the test of mediation effect is slightly more complex, and Sobel test is commonly used
The parameters in the following inspection methods are from the following figure:
Practice of intermediary analysis
After understanding the principle of intermediary analysis and the statistical method of test, an example is given to illustrate the process of intermediary analysis.
Design simulation data
Using the data of R's own iris for simulation research, here we define sepal length sepal Length is an independent variable, named X, which defines its attraction to bees as an intermediary variable and named M. We define a dependent variable. We assume that its meaning is the possibility of being finally collected by bees, named Y. in order to create some noise and intermediary effects, we define the following relationship between these three variables:
In this way, our expected mediating effect is 12.25%, and then we will test it
Create simulation data
df=iris set.seed(12334) colnames(df)[1] = "X" df$e1 = runif(nrow(df), min = min(df$X), max = max(df$X)) df$M = df$X * 0.35 + df$e1 * 0.65 df$e2 = runif(nrow(df), min = min(df$M), max = max(df$M)) df$Y = df$M * 0.35 + df$e2 * 0.65
Test total effect
The test equation is:
The inspection code is:
fit_total = lm(Y ~ X, df) summary(fit_total)
The output is as follows:
Call: lm(formula = Y ~ X, data = df) Residuals: Min 1Q Median 3Q Max -1.15930 -0.45815 -0.01242 0.44662 1.20905 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.29106 0.32791 16.136 <2e-16 *** X 0.12984 0.05557 2.337 0.0208 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5616 on 148 degrees of freedom Multiple R-squared: 0.03558, Adjusted R-squared: 0.02907 F-statistic: 5.46 on 1 and 148 DF, p-value: 0.02079
It can be seen that the total effect is 0.12984, close to our expectation of 12.25%, and the p value is 0.0208. The first step test is significant, and then proceed to the following steps
Test the effect of dependent variables on intermediate variables
The test equation is:
The inspection code is:
fit_mediator = lm(M ~ X, df) summary(fit_mediator)
The output is as follows:
Call: lm(formula = M ~ X, data = df) Residuals: Min 1Q Median 3Q Max -1.2494 -0.5082 0.0123 0.5483 1.0799 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.32300 0.37901 11.406 < 2e-16 *** X 0.30429 0.06422 4.738 5.02e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6492 on 148 degrees of freedom Multiple R-squared: 0.1317, Adjusted R-squared: 0.1258 F-statistic: 22.45 on 1 and 148 DF, p-value: 5.019e-06
It can be seen that the effect of the intermediary variable on the dependent variable is 0.30429, close to our expectation of 35%, and the p value is 5.019e-06. The second test is also significant, and then proceed to the following steps
Test direct effect
The test equation is:
The inspection code is:
fit_direct=lm(Y ~ X + M, df) summary(fit_direct)
The output is as follows:
Call: lm(formula = Y ~ X + M, data = df) Residuals: Min 1Q Median 3Q Max -0.90734 -0.46316 0.00764 0.39751 0.87026 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.68314 0.40721 9.045 8.17e-16 *** X 0.01667 0.05402 0.309 0.758 M 0.37194 0.06443 5.773 4.46e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5088 on 147 degrees of freedom Multiple R-squared: 0.2138, Adjusted R-squared: 0.2031 F-statistic: 19.99 on 2 and 147 DF, p-value: 2.092e-08
It can be seen that the direct effect is 0.016, which is very weak, and the p value is 0.758, which is not significant, while the effect value of the intermediary variable on the dependent variable is 0.37194, which is close to our expectation of 35%, and the p value is 4.46e-08. This result shows that after the correction of the intermediary variable, the influence of the independent variable on the dependent variable is not significant, This situation is called complete mediation
sobel test
If both coefficient a and coefficient b are not significant, the significance of the mediating effect is calculated by the following formula. This method was proposed by Sobel, so it is called Sobel test:
Where \ (\ hat{a} \) is the effect of the independent variable on the intermediate variable, \ (\ hat{b} \) is the effect of the intermediate variable on the dependent variable after correction with the intermediate variable, \ (s_a \) and \ (s_b \) are the standard errors of \ (\ hat{a} \) and \ (\ hat{b} \) respectively. In the above results, it refers to the column Std. Error. The operation in R is as follows:
a = coefficients(fit_mediator)["X"] s_a = summary(fit_mediator)$coefficients["X", "Std. Error"] b = coefficients(fit_mediator)["M"] s_b = summary(fit_direct)$coefficients["M", "Std. Error"] SE = sqrt(a^2 * s_b^2 + b^2 * s_a^2) # The degree of freedom is n-k-1, K is the number of explanatory variables, here it is 2, so df = 147 t_statistic = a * b / SE p = 2 * pt( abs(t_statistic), df = df.residual(fit_mediator), lower.tail = FALSE )
summary
In the above process, the overall effect of the independent variable on the dependent variable is 0.12984, and the influence coefficient of the independent variable on the intermediate variable and the influence coefficient of the intermediate variable on the dependent variable are close to our simulated 0.35. After correcting the influence of the intermediate variable, the influence (direct effect) of the independent variable on the dependent variable becomes very small (0.01667) and insignificant. Therefore, We can say that the influence of independent variables on dependent variables is completely through intermediary variables, which is called complete intermediary effect
Since the total effect is 0.12984 and the direct effect is 0.01667, the intermediary effect is 0.12984 - 0.01667 = 0.11317, accounting for 87.16%
Simplify operations with R packages
The above operation is a little cumbersome. Is there an R package that encapsulates this process? Of course, that is mediation:
install.packages("mediation") library(mediation) results = mediate( model.m = fit_mediator, model.y = fit_direct, treat = "X", mediator = "M", boot = TRUE ) summary(results)
The results are as follows:
Causal Mediation Analysis Nonparametric Bootstrap Confidence Intervals with the Percentile Method Estimate 95% CI Lower 95% CI Upper p-value ACME 0.1132 0.0575 0.17 <2e-16 *** ADE 0.0167 -0.0904 0.12 0.752 Total Effect 0.1298 0.0152 0.24 0.018 * Prop. Mediated 0.8716 0.3796 3.68 0.018 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Sample Size Used: 150 Simulations: 1000
Here, ACME in the result represents the average mediation effect, which is equal to \ (\ hat{a}*\hat{b} \), ADE represents the direct effect, Total Effect is the Total Effect, and prop Mediated is the proportion of mediated effect
Generally, the product of the effect \ (\ beta_1 \) of the independent variable on the intermediary variable and the effect \ (\ beta_2 \) of the intermediary variable on the dependent variable is regarded as the average intermediary effect
Research using intermediary analysis
The following studies used mediation analysis
data type | URL | Mediation method | explain |
---|---|---|---|
eQTL-meQTL | nature communications | Sobel | For each Co located data pair, the correlation analysis is carried out again and the mediation effect is tested |
eQTL-pQTL | nature | mendelian randomization | Genetic variation as instrumental variable, plasma protein as exposure (i.e. pQTL), and disease as outcome |
pQTL-eQTL | nature communications | mendelian randomization | The independent site (0.1) of GWAS was used as the instrumental variable, CHD as the outcome, and protein as the exposure (pQTL). The significance of multiple independent sites and individual sites was calculated by multipoint MR and wald test respectively. The tool was MRbase |