Checking Independence in Linear Regression¶

Independence is a key assumption in linear regression that ensures the observations (and their residuals) are not correlated with one another. This assumption is critical for making valid inferences and obtaining accurate estimates of regression coefficients. If independence is violated, the model may produce biased results, underestimated standard errors, and incorrect significance tests. This section explores various methods to check for independence in linear regression, with a focus on identifying and addressing autocorrelation and other forms of dependence among observations.

1. Understanding the Independence Assumption¶

Definition: In the context of linear regression, the independence assumption means that the residuals (errors) of the model should be independent of each other. For time-series data, this implies that the residuals at one time point should not be correlated with the residuals at another time point. For cross-sectional data, this means that the residuals for one observation should not be correlated with the residuals for any other observation.

Formally, the assumption requires:

\[ \text{Cov}(\epsilon_i, \epsilon_j) = 0 \quad \text{for all } i \neq j \]

This can also be expressed in matrix form. Under the independence assumption, the variance-covariance matrix of the errors is diagonal:

\[ \text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I}_n \]

Why It Matters: If the independence assumption is violated, it can lead to:

Autocorrelation: When residuals are correlated across time or sequence, often seen in time-series data. Positive autocorrelation means a positive residual at time \(t\) tends to be followed by a positive residual at time \(t+1\).
Clustered Errors: When observations within certain groups or clusters are more similar to each other than to observations in other groups.

2. Durbin-Watson Test for Autocorrelation¶

The Durbin-Watson (DW) test is a widely used statistical test to detect the presence of first-order autocorrelation in the residuals of a regression model. This test is particularly useful for time-series data.

Test Statistic:

The Durbin-Watson statistic is defined as:

\[ DW = \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n} e_t^2} \]

where \(e_t\) is the residual at time \(t\).

The DW statistic ranges between 0 and 4:

\(DW \approx 2\): No autocorrelation
\(DW \to 0\): Strong positive autocorrelation
\(DW \to 4\): Strong negative autocorrelation

Relationship to autocorrelation:

\[ DW \approx 2(1 - \hat{\rho}) \]

where \(\hat{\rho}\) is the estimated first-order autocorrelation coefficient of the residuals.

Steps:

Fit the Linear Regression Model: First, fit your linear regression model to obtain the residuals.
Calculate the Durbin-Watson Statistic: The DW statistic will typically range between 0 and 4.
Interpret the Statistic:
A DW value around 2 indicates no autocorrelation.
A value closer to 0 suggests positive autocorrelation.
A value closer to 4 suggests negative autocorrelation.

Example:

from statsmodels.stats.stattools import durbin_watson

# Assuming 'model' is your fitted OLS model
dw_stat = durbin_watson(model.resid)
print(f'Durbin-Watson statistic: {dw_stat}')

Interpretation Guidelines:

DW Value	Interpretation
\(DW \approx 2\)	No significant autocorrelation
\(DW < 1.5\)	Positive autocorrelation (violates independence)
\(DW > 2.5\)	Negative autocorrelation (violates independence)

3. Residual Plots for Detecting Patterns¶

A residual plot is another effective tool for checking the independence assumption. By plotting the residuals against time (in time-series data) or the order of data collection (in cross-sectional data), you can visually inspect for patterns that may indicate dependence.

Steps:

Plot Residuals: Create a plot of residuals against time or the sequence/order of observations.
Assess Patterns: Look for systematic patterns such as trends, cycles, or clusters in the plot.

Example:

import matplotlib.pyplot as plt

# Assuming 'model' is your fitted OLS model and 'X' is the time or order variable
plt.plot(X, model.resid)
plt.xlabel('Time or Sequence')
plt.ylabel('Residuals')
plt.title('Residuals vs. Time/Order')
plt.axhline(y=0, color='red', linestyle='--')
plt.show()

Interpretation:

No Clear Pattern: Random scatter around zero indicates independence.
Visible Pattern: Trends, cycles, or clusters suggest a violation of independence, indicating potential autocorrelation or another form of dependence.

Common patterns and their meaning:

Pattern	Likely Cause
Smooth waves	Seasonal or cyclical autocorrelation
Upward/downward trend	Missing trend variable in the model
Alternating signs	Negative autocorrelation
Runs of same sign	Positive autocorrelation

4. Breusch-Godfrey Test for Higher-Order Autocorrelation¶

The Breusch-Godfrey test is an extension of the Durbin-Watson test and is more versatile in detecting higher-order autocorrelation (not just the first-order). This test is useful when you suspect that the autocorrelation may extend beyond adjacent observations.

Hypotheses:

\(H_0\): No autocorrelation up to lag \(p\)
\(H_1\): Autocorrelation exists at some lag \(\leq p\)

The test regresses the residuals on the original predictors and lagged residuals:

\[ e_t = \alpha_0 + \alpha_1 X_{1t} + \cdots + \rho_1 e_{t-1} + \rho_2 e_{t-2} + \cdots + \rho_p e_{t-p} + u_t \]

The test statistic is \(nR^2\) from this auxiliary regression, which follows a \(\chi^2(p)\) distribution under \(H_0\).

Steps:

Fit the Linear Regression Model: Obtain the residuals from your fitted model.
Perform the Breusch-Godfrey Test: The test will provide a statistic and p-value.
Interpret the Results: A significant p-value (typically < 0.05) suggests the presence of autocorrelation.

Example:

from statsmodels.stats.diagnostic import acorr_breusch_godfrey

# Perform the Breusch-Godfrey test
bg_test = acorr_breusch_godfrey(model, nlags=2)
print(f'Breusch-Godfrey LM statistic: {bg_test[0]}')
print(f'Breusch-Godfrey p-value: {bg_test[1]}')

Interpretation:

p-value > 0.05: No significant autocorrelation detected.
p-value < 0.05: Significant autocorrelation present, indicating a violation of independence.

Advantages over Durbin-Watson:

Can detect higher-order autocorrelation (lag 2, 3, etc.)
Works with lagged dependent variables as regressors
More general and flexible

5. Examining Data Collection Process¶

Sometimes, dependence among observations arises from the data collection process itself, particularly in clustered or hierarchical data structures (e.g., students within schools, patients within hospitals). It is important to understand the structure of your data and check for potential clustering.

Steps:

Understand Data Structure: Identify if the data is collected in groups or clusters.
Check for Clustering: Use hierarchical or mixed-effects models if clustering is suspected, as standard linear regression may not account for the within-cluster correlation.

Common clustered data structures:

Structure	Level 1	Level 2	Example
Educational	Students	Schools	Test scores by school
Medical	Patients	Hospitals	Treatment outcomes by hospital
Geographic	Observations	Regions	Economic data by state
Longitudinal	Time points	Subjects	Repeated measurements per person

Interpretation:

No Clustering: If data is not clustered, independence may hold.
Clustering Detected: If data is clustered, independence may be violated, and you should consider alternative modeling approaches (e.g., mixed-effects models, clustered standard errors).

Summary of Independence Diagnostics¶

Method	Type	Detects	Best For
Durbin-Watson	Formal test	First-order autocorrelation	Time-series data
Residual vs. time plot	Visual	Any temporal pattern	Time-series, ordered data
Breusch-Godfrey	Formal test	Higher-order autocorrelation	Complex time dependence
Data structure review	Conceptual	Clustering, hierarchical dependence	Cross-sectional clustered data

Ensuring independence in linear regression is crucial for valid statistical inference and reliable predictions. By employing tests like the Durbin-Watson test, Breusch-Godfrey test, and examining residual plots, you can diagnose potential violations of the independence assumption. In cases where independence is violated, it is important to address the issue through appropriate modeling techniques, such as adding lagged variables, using generalized least squares, or employing mixed-effects models for clustered data.