Simple Linear Regression with linregress¶
In many scientific and engineering applications, we observe two variables and want to
quantify how one depends on the other. Simple linear regression provides the most
fundamental approach: fit a straight line through the data and measure how well that
line explains the observed variation. SciPy's stats.linregress offers a fast,
one-call interface for this task, returning the fitted parameters along with key
inferential statistics.
Mental Model
linregress is SciPy's one-liner for fitting \(y = mx + b\). It returns slope, intercept, \(r\)-value, p-value, and standard error in a single call. Think of it as the simplest possible regression -- one predictor, one response, one straight line -- with built-in significance testing.
The Linear Model¶
Simple linear regression assumes that the relationship between a predictor \(x\) and a response \(y\) follows
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon_i\) are independent error terms with \(\mathbb{E}[\varepsilon_i] = 0\) and \(\operatorname{Var}(\varepsilon_i) = \sigma^2\).
The goal is to estimate \(\beta_0\) and \(\beta_1\) from observed data \(\{(x_i, y_i)\}_{i=1}^n\) so that the fitted line \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\) best approximates the data in the least-squares sense.
OLS Estimators¶
Ordinary least squares (OLS) minimizes the sum of squared residuals
Setting partial derivatives to zero yields the closed-form solutions
where \(\bar{x}\) and \(\bar{y}\) are the sample means of \(x\) and \(y\) respectively.
Coefficient of Determination¶
The coefficient of determination \(R^2\) measures the proportion of variance in \(y\) that the linear model explains. It is defined as
An \(R^2\) value near 1 indicates that the line captures most of the variability, while a value near 0 indicates a poor fit. In simple linear regression, \(R^2 = r^2\) where \(r\) is the Pearson correlation coefficient.
Hypothesis Test for the Slope¶
To test whether the predictor \(x\) has a statistically significant linear relationship with \(y\), we test
The test statistic is
where the standard error of the slope is
and \(\hat{\sigma}^2 = \frac{1}{n-2}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\) is the residual variance estimator. Under \(H_0\), \(t\) follows a \(t\)-distribution with \(n - 2\) degrees of freedom.
Using scipy.stats.linregress¶
The linregress function accepts two arrays and returns five values packed into
a named result object.
```python from scipy import stats
Example: hours studied vs exam score¶
hours = [1, 2, 3, 4, 5, 6, 7, 8] scores = [52, 58, 65, 68, 73, 79, 84, 90]
result = stats.linregress(hours, scores) ```
The returned LinregressResult contains:
| Attribute | Description |
|---|---|
slope |
Estimated slope \(\hat{\beta}_1\) |
intercept |
Estimated intercept \(\hat{\beta}_0\) |
rvalue |
Pearson correlation coefficient \(r\) |
pvalue |
Two-sided p-value for testing \(H_0: \beta_1 = 0\) |
stderr |
Standard error of the slope \(\text{SE}(\hat{\beta}_1)\) |
Accessing intercept standard error
Since SciPy 1.7, result.intercept_stderr provides the standard error
of the intercept estimate \(\text{SE}(\hat{\beta}_0)\).
Interpreting the Output¶
python
print(f"Slope: {result.slope:.4f}")
print(f"Intercept: {result.intercept:.4f}")
print(f"R-squared: {result.rvalue**2:.4f}")
print(f"p-value: {result.pvalue:.4e}")
print(f"Std error: {result.stderr:.4f}")
The slope \(\hat{\beta}_1\) estimates the average change in \(y\) per unit increase in \(x\). The p-value tests whether this slope differs significantly from zero. A small p-value (typically below 0.05) leads to rejecting \(H_0\) and concluding that the linear relationship is statistically significant.
The \(R^2\) value (obtained as result.rvalue**2) indicates how much of the
variation in the response is captured by the linear fit.
Prediction¶
Once the model is fitted, predictions for new values of \(x\) are computed as
```python import numpy as np
x_new = np.array([9, 10]) y_pred = result.intercept + result.slope * x_new ```
Extrapolation risk
Predictions outside the range of the observed \(x\) values rely on the assumption that the linear relationship continues to hold. Extrapolation can produce misleading results if the true relationship is nonlinear beyond the observed range.
Assumptions and Limitations¶
Simple linear regression via linregress assumes:
- Linearity -- the true relationship between \(x\) and \(y\) is linear
- Independence -- observations are independent of each other
- Homoscedasticity -- the error variance \(\sigma^2\) is constant across all \(x\)
- Normality -- the errors \(\varepsilon_i\) are normally distributed (required for the \(t\)-test and p-value to be exact)
When these assumptions are violated, consider transformations, robust regression, or nonparametric alternatives. Residual analysis provides diagnostic tools for checking these assumptions.
Summary¶
scipy.stats.linregress fits a simple linear regression model by ordinary least
squares and returns the slope, intercept, Pearson correlation, p-value, and
standard error in a single function call. The slope estimator and its p-value
enable formal testing of linear dependence, while \(R^2\) quantifies the
explanatory power of the fitted line.
Runnable Example: seaborn_regression_plots.py¶
```python """ Tutorial 05: Regression Plots Learn linear regression visualization, regplot, lmplot, residplot Level: Intermediate """ import seaborn as sns import matplotlib.pyplot as plt import pandas as pd import numpy as np
=============================================================================¶
Main¶
=============================================================================¶
if name == "main":
sns.set_style("whitegrid")
tips = sns.load_dataset('tips')
# regplot - simple regression
plt.figure(figsize=(10, 6))
sns.regplot(data=tips, x='total_bill', y='tip')
plt.title('Linear Regression Plot', fontsize=14, fontweight='bold')
plt.show()
# lmplot - figure-level with faceting
g = sns.lmplot(data=tips, x='total_bill', y='tip', hue='time', height=5, aspect=1.5)
plt.show()
# residplot - check model assumptions
plt.figure(figsize=(10, 6))
sns.residplot(data=tips, x='total_bill', y='tip')
plt.title('Residual Plot', fontsize=14, fontweight='bold')
plt.show()
print("Tutorial 05 demonstrates regression visualizations")
print("Key functions: regplot(), lmplot(), residplot()")
```
Exercises¶
Exercise 1.
Use stats.linregress to fit a line to \(x = [1, 2, 3, 4, 5, 6]\) and \(y = [2.1, 3.9, 6.2, 7.8, 10.1, 11.9]\). Print the slope, intercept, \(R^2\), and p-value. Predict the value at \(x = 10\).
Solution to Exercise 1
import numpy as np
from scipy import stats
x = [1, 2, 3, 4, 5, 6]
y = [2.1, 3.9, 6.2, 7.8, 10.1, 11.9]
result = stats.linregress(x, y)
print(f"Slope: {result.slope:.4f}, Intercept: {result.intercept:.4f}")
print(f"R-squared: {result.rvalue**2:.4f}, p-value: {result.pvalue:.4e}")
print(f"Prediction at x=10: {result.intercept + result.slope * 10:.4f}")
Exercise 2.
Generate 100 samples from \(Y = 3 + 2X + \varepsilon\) where \(X \sim U(0, 10)\) and \(\varepsilon \sim N(0, 2)\). Fit using linregress, compare estimated slope and intercept to the true values, and compute the 95% confidence interval for the slope.
Solution to Exercise 2
import numpy as np
from scipy import stats
np.random.seed(42)
x = np.random.uniform(0, 10, 100)
y = 3 + 2 * x + np.random.normal(0, 2, 100)
result = stats.linregress(x, y)
t_crit = stats.t.ppf(0.975, df=98)
ci_low = result.slope - t_crit * result.stderr
ci_high = result.slope + t_crit * result.stderr
print(f"Slope: {result.slope:.4f} (true: 2), Intercept: {result.intercept:.4f} (true: 3)")
print(f"95% CI for slope: [{ci_low:.4f}, {ci_high:.4f}]")
Exercise 3. Fit a linear regression to data that has a clear quadratic relationship: \(x = [1, ..., 20]\), \(y = x^2\). Examine the \(R^2\) value and residual pattern. Explain why the linear fit is inadequate.
Solution to Exercise 3
import numpy as np
from scipy import stats
x = np.arange(1, 21, dtype=float)
y = x ** 2
result = stats.linregress(x, y)
residuals = y - (result.intercept + result.slope * x)
print(f"R-squared: {result.rvalue**2:.4f}")
print(f"Residual range: [{residuals.min():.1f}, {residuals.max():.1f}]")
print("The parabolic residual pattern shows the linear model is inadequate")