Quasi Stuff - Difference in Differences

Stats
Inference
Author

Yiğit Aşık

Published

September 7, 2025

Introduction to Difference in Differences (DiD)

In a classic causal inference setting, at the end of the day what we are interested in is \(\text{E}[Y_1 - Y_0 | \text{D}=1]\) where D is the treatment.

When we have an intervention (e.g., policy changes, marketing campaign etc), what we end up with is a time period before and after the intervention. Hence, average treatment effect on treated (ATET): \(\hat{\text{ATET}} = \text{E}[Y_1(1) - Y_0(1) | \text{D}=1]\) where \(Y_D(T)\) is the potential outcome for treatment D on time T. To make things a bit clear with the word “potential”: We either observe the treated or untreated, can’t observe both at once for any individual since we can’t turn back time and do it the other way. Okay, since we got that out of the way, let’s continue.

You may have thought about different approaches to this ATET, like doing a before vs after: \(\hat{\text{ATET}} = \text{E}[Y(1) | \text{D}=1] - \text{E}[Y(0) | \text{D}=1]\)

Well, this one assumes \(\text{E}[Y_0(1) | \text{D}=1] = \text{E}[Y_0(0) | \text{D}=1]\). In plain English, it says that the Y before and after the intervention would be equal, in the case of no intervention. This might sound reasonable at first but imagine a line plot with any kind of trend (positive or negative) – in either of them, it’s problematic since there’s some temporal effect on Y. So, cutting the line plot vertically in any point and looking at both sides wouldn’t yield the same y values.

One other approach is: \(\hat{\text{ATET}} = \text{E}[Y(1) | \text{D}=1] - \text{E}[Y(1) | \text{D}=0]\). So, comparing the treated group to untreated group. We assume that the counterfactual can be replaced: \(\text{E}[Y_0(1) | \text{D}=0] = \text{E}[Y_0(1) | \text{D}=1]\). This requires a similar baseline level, which is sometimes accomplished by stratifying or propensity matching (they have their own issues).

DiD solves this by:

\(\displaystyle \text{E}[Y_0(1) | \text{D}=1] = \text{E}[Y_0(0) | \text{D}=1] + (\text{E}[Y_0(1) | \text{D=0}] - \text{E}[Y_0(0) | \text{D}=0])\)

It states that the counterfactual, the version of the treated had it not been treated, after the intervention, is equal to the treated before the treatment plus the trend of the control. Hence, an another assumption: Trends are the same between the treatment and the control. \(\text{E}[Y_0(1) - Y_0(0) | \text{D}=1] = \text{E}[Y_0(1) - Y_0(0) | \text{D}=0]\)

Alright, let’s put everything together then (alongside some algebraic arrangements):

\(\displaystyle \hat{\text{ATET}} = (\text{E}[Y(1) | \text{D}=1] - \text{E}[Y(1) | \text{D}=0]) - (\text{E}[Y(0) | \text{D}=1] - \text{E}[Y(0) | \text{D}=0])\)

Thus the name, difference in differences, since what we have is the difference between the treatment and control before and after the intervention.

Let’s see it in action.

import pandas as pd
import numpy as np

import duckdb as db

from matplotlib import pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.panel import PanelOLS

import warnings

warnings.filterwarnings('ignore')

pd.set_option('display.float_format', lambda x: '%.3f' % x)
# amazon = pd.read_excel('amazon.xlsx')
amazon = db.query("SELECT * FROM read_csv_auto('am_did.csv', IGNORE_ERRORS=TRUE)").to_df()
goodreads = db.query("SELECT * FROM read_csv_auto('gr_did.csv', IGNORE_ERRORS=TRUE)").to_df()
amazon.drop(['column0', 'reviewerID', 'reviewText', 'summary'], axis=1, inplace=True)
goodreads.drop(['column0', 'book_id', 'review_id', 'review_text', 'date_updated', 'user_id'], axis=1, inplace=True)

We have Amazon and Goodreads data. Goodreads added a new feature on May 21, 2014 which allows users to post questions to others (including authors). Let’s find out this feature’s effect on book ratings.

amazon['date'] = pd.to_datetime(amazon['unixReviewTime'])
goodreads['date'] = pd.to_datetime(goodreads['date'])
first_date = pd.to_datetime('1996-11-14')
amazon['t'] = (amazon['unixReviewTime'].dt.to_period('M') - first_date.to_period('M')).apply(lambda x: x.n)
goodreads['t'] = (goodreads['date'].dt.to_period('M') - first_date.to_period('M')).apply(lambda x: x.n)
goodreads.loc[goodreads['year_month'] == 'May 2014', :].head()
rating date year_month asin t
105 5 2014-05-26 May 2014 1855496593 210
531 5 2014-05-27 May 2014 1855496593 210
733 5 2014-05-06 May 2014 1855496593 210
972 5 2014-05-12 May 2014 1855496593 210
977 5 2014-05-10 May 2014 1855496593 210
amazon['goodr'] = 0
goodreads['goodr'] = 1
df = pd.concat(
    [amazon[['goodr', 'year_month', 'date', 'rating', 't', 'asin']], goodreads[['goodr', 'year_month', 'date', 'rating', 't', 'asin']]],
    ignore_index=True
)
df.head()
goodr year_month date rating t asin
0 0 Jan 2007 2007-01-03 4 122 000100039X
1 0 Apr 2015 2015-04-16 5 221 000100039X
2 0 Apr 2015 2015-04-08 5 221 000100039X
3 0 Jul 2013 2013-07-03 5 200 000100039X
4 0 Nov 2016 2016-11-13 5 240 000100039X
df['qa'] = (df['date'] >= pd.to_datetime('2014-05-21')).astype(int)
df.head()
goodr year_month date rating t asin qa
0 0 Jan 2007 2007-01-03 4 122 000100039X 0
1 0 Apr 2015 2015-04-16 5 221 000100039X 1
2 0 Apr 2015 2015-04-08 5 221 000100039X 1
3 0 Jul 2013 2013-07-03 5 200 000100039X 0
4 0 Nov 2016 2016-11-13 5 240 000100039X 1
df.groupby(['goodr', 'qa']).agg(
    rating_mean = ('rating', np.mean)
).reset_index()
goodr qa rating_mean
0 0 0 4.255
1 0 1 4.361
2 1 0 3.791
3 1 1 3.801

And, here’s the diff-in-diffs.

(3.801 - 3.791) - (4.361 - 4.255)
-0.09599999999999964

Let’s check the parallel trends assumption.

goodr_agg = goodr.groupby('t').agg(
    avg_rating = ('rating', np.mean)
).reset_index()

amaz_agg = amaz.groupby('t').agg(
    avg_rating = ('rating', np.mean)
).reset_index()

goodr_agg.head()
t avg_rating
0 125 3.832
1 126 3.788
2 127 3.706
3 128 3.755
4 129 3.776
plt.figure(figsize=(7, 5))

plt.plot(goodr_agg.t, goodr_agg.avg_rating, label='goodr')
plt.plot(amaz_agg.t, amaz_agg.avg_rating, label='amazon')

plt.plot(np.repeat(210, 201), np.linspace(3, 5, 201), ls='--', color='black', label='QA release', alpha=.57)

plt.ylim(3.5, 4.5)

plt.legend()

They look pretty similar prior to the treatment, hence it’s fine (DiD would be biased otherwise).

Well, do we just conclude negative effect? Not really. We would like to control for other variables which is possible via regression. On top of that, we get standard error which helps with quantifying the uncertainty surrounding the estimate.

lm = smf.ols('rating ~ C(qa)*C(goodr)', data=df)
res = lm.fit()
res.summary()
OLS Regression Results
Dep. Variable: rating R-squared: 0.045
Model: OLS Adj. R-squared: 0.045
Method: Least Squares F-statistic: 2.295e+04
Date: Sun, 07 Sep 2025 Prob (F-statistic): 0.00
Time: 21:47:22 Log-Likelihood: -2.2720e+06
No. Observations: 1469344 AIC: 4.544e+06
Df Residuals: 1469340 BIC: 4.544e+06
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.2552 0.002 2275.194 0.000 4.252 4.259
C(qa)[T.1] 0.1059 0.002 45.324 0.000 0.101 0.110
C(goodr)[T.1] -0.4646 0.003 -155.565 0.000 -0.470 -0.459
C(qa)[T.1]:C(goodr)[T.1] -0.0954 0.004 -23.027 0.000 -0.104 -0.087
Omnibus: 364326.390 Durbin-Watson: 1.825
Prob(Omnibus): 0.000 Jarque-Bera (JB): 740463.435
Skew: -1.501 Prob(JB): 0.00
Kurtosis: 4.757 Cond. No. 6.81


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can also add fixed effects as well. Different model specifications can result in different estimates, hence it’s important to choose right type of model. Here, for convenience, I go for single fixed effect.

df['year_month'] = df['year_month'].astype('category')
df['goodr_qa'] =  df['goodr'] * df['qa']

df.set_index(['year_month', df.index], inplace=True)

exog = sm.add_constant(df[['goodr', 'qa', 'goodr_qa']])
fe_model = PanelOLS(df['rating'], exog, entity_effects=True).fit()
fe_model.summary
PanelOLS Estimation Summary
Dep. Variable: rating R-squared: 0.0375
Estimator: PanelOLS R-squared (Between): 0.1461
No. Observations: 1469344 R-squared (Within): 0.0375
Date: Sun, Sep 07 2025 R-squared (Overall): 0.0433
Time: 21:47:23 Log-likelihood -2.27e+06
Cov. Estimator: Unadjusted
F-statistic: 1.908e+04
Entities: 258 P-value 0.0000
Avg Obs: 5695.1 Distribution: F(3,1469083)
Min Obs: 1.0000
Max Obs: 2.768e+04 F-statistic (robust): 1.908e+04
P-value 0.0000
Time periods: 1469344 Distribution: F(3,1469083)
Avg Obs: 1.0000
Min Obs: 1.0000
Max Obs: 1.0000
Parameter Estimates
Parameter Std. Err. T-stat P-value Lower CI Upper CI
const 4.3027 0.0106 406.99 0.0000 4.2820 4.3235
goodr -0.4520 0.0032 -142.57 0.0000 -0.4582 -0.4458
qa 0.0212 0.0179 1.1876 0.2350 -0.0138 0.0563
goodr_qa -0.1070 0.0043 -24.868 0.0000 -0.1155 -0.0986


F-test for Poolability: 19.050
P-value: 0.0000
Distribution: F(257,1469083)

Included effects: Entity

You can see the difference in each results. What else can be done? It’s possible to add two-way fixed effects, adding books to the model, also to cluster standard errors since observation within each book have more similarity compared to others which would result in more appropriate standard error estimates.

I’m not going to dive further for now, since I haven’t discussed these very explicitly in the blog (yet). However, if you’re interested you can take a look at the following:

Tilburg Science Hub - DiD