Estimating Average Causal Effect

The Average Causal Effect (ACE) is quite often a population-level inference of interest. The ACE is difficult to estimate because we often have confounding variables which pollute and bias our estimation of causal effects. Using causal models and inference, we can estimate the ACE using the do-operator. Let’s see how we can do so with simulated data.

Causal model

Let’s define the following causal model.

  • \(C \sim \mathcal{N}(1, 1)\)

  • \(X \sim \mathcal{N}(2 + 3 C, 1)\)

  • \(Y \sim \mathcal{N}(0.5 + 2.5 C + 1.5 X, 1)\)

Where,

  • \(C\) is the confounder (of \(X\) and \(Y\)),

  • \(X\) is the cause, and

  • \(Y\) is the effect.

In estimating the ACE, we want to know the causal effect of \(X\) on \(Y\) controlling for \(C\). Below, we simulate data from this causal model.

[1]:
import numpy as np
import pandas as pd

np.random.seed(37)

Xy = pd.DataFrame() \
    .assign(C=np.random.normal(1, 1, 10_000)) \
    .assign(X=lambda d: np.random.normal(2 + 3 * d['C'])) \
    .assign(Y=lambda d: np.random.normal(0.5 + 2.5 * d['C'] + 1.5 * d['X']))

Xy.shape
[1]:
(10000, 3)
[2]:
Xy.describe()
[2]:
C X Y
count 10000.000000 10000.000000 10000.000000
mean 1.001723 4.995999 10.503296
std 0.995339 3.158066 7.219975
min -3.158679 -7.957580 -21.029061
25% 0.338421 2.884758 5.718008
50% 1.003041 4.973037 10.493518
75% 1.669959 7.148329 15.395528
max 4.479742 17.173675 37.909525

DoubleML ACE estimation

We can use DoubleML to estimate the causal effect of \(X\) on \(Y\) as follows.

[3]:
from sklearn.base import clone
from sklearn.linear_model import LinearRegression
from doubleml import DoubleMLData
from doubleml import DoubleMLPLR

dml_data = DoubleMLData(
    Xy,
    y_col='Y',
    d_cols='X',
    x_cols=['C']
)

learner = LinearRegression()
ml_l = clone(learner)
ml_m = clone(learner)

dml_model = DoubleMLPLR(dml_data, ml_l, ml_m)
dml_model.fit()
dml_model.summary
[3]:
coef std err t P>|t| 2.5 % 97.5 %
X 1.506062 0.009728 154.824303 0.0 1.486996 1.525128

We can also try to estimate the causal effect of \(C\) on \(Y\).

[4]:
dml_data = DoubleMLData(
    Xy,
    y_col='Y',
    d_cols='C',
    x_cols=['X']
)

learner = LinearRegression()
ml_l = clone(learner)
ml_m = clone(learner)

dml_model = DoubleMLPLR(dml_data, ml_l, ml_m)
dml_model.fit()
dml_model.summary
[4]:
coef std err t P>|t| 2.5 % 97.5 %
C 2.492434 0.030878 80.719026 0.0 2.431915 2.552954

The ACEs (coefficients) estimated from DoubleML match the coefficients we used in simulating the data.

Py-SCM ACE estimation

ACE estimation with py-scm is also possible using the iquery() method. However, we are not estimating the coefficient. Instead we are estimated the expected value.

\(E[Y | \mathrm{do}(X=x)]\)

Let’s create our model.

[5]:
from pyscm.reasoning import create_reasoning_model

d = {
    'nodes': ['C', 'X', 'Y'],
    'edges': [
        ('C', 'X'),
        ('C', 'Y'),
        ('X', 'Y')
    ]
}

p = {
    'v': Xy.columns.tolist(),
    'm': Xy.mean().values,
    'S': Xy.cov().values
}

model = create_reasoning_model(d, p)

Now, let’s estimate the causal effects on \(Y\) when we \(\mathrm{do}(X=2)\), namely, \(E[Y | \mathrm{do}(X=2)]\).

[6]:
model.iquery('Y', {'X': 2.0})
[6]:
mean    6.036492
std     2.389106
dtype: float64

Let’s estimate the causal effects on \(Y\) when we \(\mathrm{do}(X=0)\), namely, \(E[Y|\mathrm{do}(X=-1)]\).

[7]:
model.iquery('Y', {'X': -1.0})
[7]:
mean    1.536078
std     2.389106
dtype: float64

The ACE is the difference.

\(E[Y | \mathrm{do}(X=2)] - E[Y|\mathrm{do}(X=-1)]\).

[8]:
model.iquery('Y', {'X': 2.0}) - model.iquery('Y', {'X': -1.0})
[8]:
mean    4.500414
std     0.000000
dtype: float64

Or, use the equery() method.

[9]:
model.equery('Y', {'X': 2.0}, {'X': -1.0})
[9]:
mean    4.500414
std     0.000000
dtype: float64