I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef
, foolishly not realizing that the original question already uses corrcoef
and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.
statsmodels
has the capability to calculate the r^2
of a
polynomial fit directly, here are 2 methods...
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Construct the columns for the different powers of x
def get_r2_statsmodels[x, y, k=1]:
xpoly = np.column_stack[[x**i for i in range[k+1]]]
return sm.OLS[y, xpoly].fit[].rsquared
# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula[x, y, k=1]:
formula = 'y ~ 1 + ' + ' + '.join['I[x**{}]'.format[i] for i in range[1, k+1]]
data = {'x': x, 'y': y}
return smf.ols[formula, data].fit[].rsquared # or rsquared_adj
To further take advantage of statsmodels
, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared
.
model = sm.OLS[y, xpoly]
results = model.fit[]
results.summary[]
Below is my original Answer where I benchmarked various linear regression r^2 methods...
The
corrcoef function used in the Question calculates the correlation coefficient, r
, only for a single linear regression, so it doesn't address the question of r^2
for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r
.
def get_r2_numpy_corrcoef[x, y]:
return np.corrcoef[x, y][0, 1]**2
These were my timeit results from comparing a bunch of methods for 1000 random [x, y] points:
- Pure Python [direct
r
calculation]- 1000 loops, best of 3: 1.59 ms per loop
- Numpy polyfit [applicable to n-th degree polynomial fits]
- 1000 loops, best of 3: 326 µs per loop
- Numpy Manual [direct
r
calculation]- 10000 loops, best of 3: 62.1 µs per loop
- Numpy corrcoef [direct
r
calculation]- 10000 loops, best of 3: 56.6 µs per loop
- Scipy [linear regression with
r
as an output]- 1000 loops, best of 3: 676 µs per loop
- Statsmodels [can do n-th degree polynomial and many other fits]
- 1000 loops, best of 3: 422 µs per loop
The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what
numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef
is the best tool for calculating r
for a simple linear regression.
Here's my benchmarking code. I copy-pasted from a Jupyter Notebook [hard not to call it an IPython Notebook...], so I apologize if anything broke on the way. The %timeit magic command requires IPython.
import numpy as np
from scipy import stats
import statsmodels.api as sm
import math
n=1000
x = np.random.rand[1000]*10
x.sort[]
y = 10 * x + [5+np.random.randn[1000]*10-5]
x_list = list[x]
y_list = list[y]
def get_r2_numpy[x, y]:
slope, intercept = np.polyfit[x, y, 1]
r_squared = 1 - [sum[[y - [slope * x + intercept]]**2] / [[len[y] - 1] * np.var[y, ddof=1]]]
return r_squared
def get_r2_scipy[x, y]:
_, _, r_value, _, _ = stats.linregress[x, y]
return r_value**2
def get_r2_statsmodels[x, y]:
return sm.OLS[y, sm.add_constant[x]].fit[].rsquared
def get_r2_python[x_list, y_list]:
n = len[x_list]
x_bar = sum[x_list]/n
y_bar = sum[y_list]/n
x_std = math.sqrt[sum[[[xi-x_bar]**2 for xi in x_list]]/[n-1]]
y_std = math.sqrt[sum[[[yi-y_bar]**2 for yi in y_list]]/[n-1]]
zx = [[xi-x_bar]/x_std for xi in x_list]
zy = [[yi-y_bar]/y_std for yi in y_list]
r = sum[zxi*zyi for zxi, zyi in zip[zx, zy]]/[n-1]
return r**2
def get_r2_numpy_manual[x, y]:
zx = [x-np.mean[x]]/np.std[x, ddof=1]
zy = [y-np.mean[y]]/np.std[y, ddof=1]
r = np.sum[zx*zy]/[len[x]-1]
return r**2
def get_r2_numpy_corrcoef[x, y]:
return np.corrcoef[x, y][0, 1]**2
print['Python']
%timeit get_r2_python[x_list, y_list]
print['Numpy polyfit']
%timeit get_r2_numpy[x, y]
print['Numpy Manual']
%timeit get_r2_numpy_manual[x, y]
print['Numpy corrcoef']
%timeit get_r2_numpy_corrcoef[x, y]
print['Scipy']
%timeit get_r2_scipy[x, y]
print['Statsmodels']
%timeit get_r2_statsmodels[x, y]
7/28/21 Benchmark results. [Python 3.7, numpy 1.19, scipy 1.6, statsmodels 0.12]
Python
2.41 ms ± 180 µs per loop [mean ± std. dev. of 7 runs, 100 loops each]
Numpy polyfit
318 µs ± 44.3 µs per loop [mean ± std. dev. of 7 runs, 1000 loops each]
Numpy Manual
79.3 µs ± 4.05 µs per loop [mean ± std. dev. of 7 runs, 10000 loops each]
Numpy corrcoef
83.8 µs ± 1.37 µs per loop [mean ± std. dev. of 7 runs, 10000 loops each]
Scipy
221 µs ± 7.12 µs per loop [mean ± std. dev. of 7 runs, 1000 loops each]
Statsmodels
375 µs ± 3.63 µs per loop [mean ± std. dev. of 7 runs, 1000 loops each]