Estimating Average Causal Effect

The Average Causal Effect (ACE) is quite often a population-level inference of interest. The ACE is difficult to estimate because we often have confounding variables which pollute and bias our estimation of causal effects. Using causal models and inference, we can estimate the ACE using the do-operator. Let’s see how we can do so with simulated data.

Causal model

Let’s define the following causal model.

\(C \sim \mathcal{N}(1, 1)\)
\(X \sim \mathcal{N}(2 + 3 C, 1)\)
\(Y \sim \mathcal{N}(0.5 + 2.5 C + 1.5 X, 1)\)

Where,

\(C\) is the confounder (of \(X\) and \(Y\)),
\(X\) is the cause, and
\(Y\) is the effect.

In estimating the ACE, we want to know the causal effect of \(X\) on \(Y\) controlling for \(C\). Below, we simulate data from this causal model.

[1]:

import numpy as np
import pandas as pd

np.random.seed(37)

Xy = pd.DataFrame() \
    .assign(C=np.random.normal(1, 1, 10_000)) \
    .assign(X=lambda d: np.random.normal(2 + 3 * d['C'])) \
    .assign(Y=lambda d: np.random.normal(0.5 + 2.5 * d['C'] + 1.5 * d['X']))

Xy.shape

[1]:

(10000, 3)

[2]:

Xy.describe()

[2]:

	C	X	Y
count	10000.000000	10000.000000	10000.000000
mean	1.001723	4.995999	10.503296
std	0.995339	3.158066	7.219975
min	-3.158679	-7.957580	-21.029061
25%	0.338421	2.884758	5.718008
50%	1.003041	4.973037	10.493518
75%	1.669959	7.148329	15.395528
max	4.479742	17.173675	37.909525

DoubleML ACE estimation

We can use DoubleML to estimate the causal effect of \(X\) on \(Y\) as follows.

[3]:

from sklearn.base import clone
from sklearn.linear_model import LinearRegression
from doubleml import DoubleMLData
from doubleml import DoubleMLPLR

dml_data = DoubleMLData(
    Xy,
    y_col='Y',
    d_cols='X',
    x_cols=['C']
)

learner = LinearRegression()
ml_l = clone(learner)
ml_m = clone(learner)

dml_model = DoubleMLPLR(dml_data, ml_l, ml_m)
dml_model.fit()
dml_model.summary

[3]:

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
X	1.506062	0.009728	154.824303	0.0	1.486996	1.525128

We can also try to estimate the causal effect of \(C\) on \(Y\).

[4]:

dml_data = DoubleMLData(
    Xy,
    y_col='Y',
    d_cols='C',
    x_cols=['X']
)

learner = LinearRegression()
ml_l = clone(learner)
ml_m = clone(learner)

dml_model = DoubleMLPLR(dml_data, ml_l, ml_m)
dml_model.fit()
dml_model.summary

[4]:

	coef	std err	t	P>\|t\|	2.5 %	97.5 %
C	2.492434	0.030878	80.719026	0.0	2.431915	2.552954

The ACEs (coefficients) estimated from DoubleML match the coefficients we used in simulating the data.

Py-SCM ACE estimation

ACE estimation with py-scm is also possible using the iquery() method. However, we are not estimating the coefficient. Instead we are estimated the expected value.

\(E[Y | \mathrm{do}(X=x)]\)

Let’s create our model.

[5]:

from pyscm.reasoning import create_reasoning_model

d = {
    'nodes': ['C', 'X', 'Y'],
    'edges': [
        ('C', 'X'),
        ('C', 'Y'),
        ('X', 'Y')
    ]
}

p = {
    'v': Xy.columns.tolist(),
    'm': Xy.mean().values,
    'S': Xy.cov().values
}

model = create_reasoning_model(d, p)

Now, let’s estimate the causal effects on \(Y\) when we \(\mathrm{do}(X=2)\), namely, \(E[Y | \mathrm{do}(X=2)]\).

[6]:

model.iquery('Y', {'X': 2.0})

[6]:

mean    6.036492
std     2.389106
dtype: float64

Let’s estimate the causal effects on \(Y\) when we \(\mathrm{do}(X=0)\), namely, \(E[Y|\mathrm{do}(X=-1)]\).

[7]:

model.iquery('Y', {'X': -1.0})

[7]:

mean    1.536078
std     2.389106
dtype: float64

The ACE is the difference.

\(E[Y | \mathrm{do}(X=2)] - E[Y|\mathrm{do}(X=-1)]\).

[8]:

model.iquery('Y', {'X': 2.0}) - model.iquery('Y', {'X': -1.0})

[8]:

mean    4.500414
std     0.000000
dtype: float64

Or, use the equery() method.

[9]:

model.equery('Y', {'X': 2.0}, {'X': -1.0})

[9]:

mean    4.500414
std     0.000000
dtype: float64