Estimating Average Causal Effect
The Average Causal Effect (ACE) is quite often a population-level inference of interest. The ACE is difficult to estimate because we often have confounding variables which pollute and bias our estimation of causal effects. Using causal models and inference, we can estimate the ACE using the do-operator
. Let’s see how we can do so with simulated data.
Causal model
Let’s define the following causal model.
\(C \sim \mathcal{N}(1, 1)\)
\(X \sim \mathcal{N}(2 + 3 C, 1)\)
\(Y \sim \mathcal{N}(0.5 + 2.5 C + 1.5 X, 1)\)
Where,
\(C\) is the confounder (of \(X\) and \(Y\)),
\(X\) is the cause, and
\(Y\) is the effect.
In estimating the ACE, we want to know the causal effect of \(X\) on \(Y\) controlling for \(C\). Below, we simulate data from this causal model.
[1]:
import numpy as np
import pandas as pd
np.random.seed(37)
Xy = pd.DataFrame() \
.assign(C=np.random.normal(1, 1, 10_000)) \
.assign(X=lambda d: np.random.normal(2 + 3 * d['C'])) \
.assign(Y=lambda d: np.random.normal(0.5 + 2.5 * d['C'] + 1.5 * d['X']))
Xy.shape
[1]:
(10000, 3)
[2]:
Xy.describe()
[2]:
C | X | Y | |
---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 1.001723 | 4.995999 | 10.503296 |
std | 0.995339 | 3.158066 | 7.219975 |
min | -3.158679 | -7.957580 | -21.029061 |
25% | 0.338421 | 2.884758 | 5.718008 |
50% | 1.003041 | 4.973037 | 10.493518 |
75% | 1.669959 | 7.148329 | 15.395528 |
max | 4.479742 | 17.173675 | 37.909525 |
DoubleML ACE estimation
We can use DoubleML to estimate the causal effect of \(X\) on \(Y\) as follows.
[3]:
from sklearn.base import clone
from sklearn.linear_model import LinearRegression
from doubleml import DoubleMLData
from doubleml import DoubleMLPLR
dml_data = DoubleMLData(
Xy,
y_col='Y',
d_cols='X',
x_cols=['C']
)
learner = LinearRegression()
ml_l = clone(learner)
ml_m = clone(learner)
dml_model = DoubleMLPLR(dml_data, ml_l, ml_m)
dml_model.fit()
dml_model.summary
[3]:
coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|
X | 1.506062 | 0.009728 | 154.824303 | 0.0 | 1.486996 | 1.525128 |
We can also try to estimate the causal effect of \(C\) on \(Y\).
[4]:
dml_data = DoubleMLData(
Xy,
y_col='Y',
d_cols='C',
x_cols=['X']
)
learner = LinearRegression()
ml_l = clone(learner)
ml_m = clone(learner)
dml_model = DoubleMLPLR(dml_data, ml_l, ml_m)
dml_model.fit()
dml_model.summary
[4]:
coef | std err | t | P>|t| | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|
C | 2.492434 | 0.030878 | 80.719026 | 0.0 | 2.431915 | 2.552954 |
The ACEs (coefficients) estimated from DoubleML match the coefficients we used in simulating the data.
Py-SCM ACE estimation
ACE estimation with py-scm is also possible using the iquery()
method. However, we are not estimating the coefficient. Instead we are estimated the expected value.
\(E[Y | \mathrm{do}(X=x)]\)
Let’s create our model.
[5]:
from pyscm.reasoning import create_reasoning_model
d = {
'nodes': ['C', 'X', 'Y'],
'edges': [
('C', 'X'),
('C', 'Y'),
('X', 'Y')
]
}
p = {
'v': Xy.columns.tolist(),
'm': Xy.mean().values,
'S': Xy.cov().values
}
model = create_reasoning_model(d, p)
Now, let’s estimate the causal effects on \(Y\) when we \(\mathrm{do}(X=2)\), namely, \(E[Y | \mathrm{do}(X=2)]\).
[6]:
model.iquery('Y', {'X': 2.0})
[6]:
mean 6.036492
std 2.389106
dtype: float64
Let’s estimate the causal effects on \(Y\) when we \(\mathrm{do}(X=0)\), namely, \(E[Y|\mathrm{do}(X=-1)]\).
[7]:
model.iquery('Y', {'X': -1.0})
[7]:
mean 1.536078
std 2.389106
dtype: float64
The ACE is the difference.
\(E[Y | \mathrm{do}(X=2)] - E[Y|\mathrm{do}(X=-1)]\).
[8]:
model.iquery('Y', {'X': 2.0}) - model.iquery('Y', {'X': -1.0})
[8]:
mean 4.500414
std 0.000000
dtype: float64
Or, use the equery()
method.
[9]:
model.equery('Y', {'X': 2.0}, {'X': -1.0})
[9]:
mean 4.500414
std 0.000000
dtype: float64