Hướng dẫn calculate r-squared python

I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef, foolishly not realizing that the original question already uses corrcoef and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.


statsmodels has the capability to calculate the r^2 of a polynomial fit directly, here are 2 methods...

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
    xpoly = np.column_stack([x**i for i in range(k+1)])    
    return sm.OLS(y, xpoly).fit().rsquared

# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
    formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
    data = {'x': x, 'y': y}
    return smf.ols(formula, data).fit().rsquared # or rsquared_adj

To further take advantage of statsmodels, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared.

model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()

Below is my original Answer where I benchmarked various linear regression r^2 methods...

The corrcoef function used in the Question calculates the correlation coefficient, r, only for a single linear regression, so it doesn't address the question of r^2 for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r.

def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2

These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:

  • Pure Python (direct r calculation)
    • 1000 loops, best of 3: 1.59 ms per loop
  • Numpy polyfit (applicable to n-th degree polynomial fits)
    • 1000 loops, best of 3: 326 µs per loop
  • Numpy Manual (direct r calculation)
    • 10000 loops, best of 3: 62.1 µs per loop
  • Numpy corrcoef (direct r calculation)
    • 10000 loops, best of 3: 56.6 µs per loop
  • Scipy (linear regression with r as an output)
    • 1000 loops, best of 3: 676 µs per loop
  • Statsmodels (can do n-th degree polynomial and many other fits)
    • 1000 loops, best of 3: 422 µs per loop

The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression.

Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.

import numpy as np
from scipy import stats
import statsmodels.api as sm
import math

n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)

x_list = list(x)
y_list = list(y)

def get_r2_numpy(x, y):
    slope, intercept = np.polyfit(x, y, 1)
    r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
    return r_squared
    
def get_r2_scipy(x, y):
    _, _, r_value, _, _ = stats.linregress(x, y)
    return r_value**2
    
def get_r2_statsmodels(x, y):
    return sm.OLS(y, sm.add_constant(x)).fit().rsquared
    
def get_r2_python(x_list, y_list):
    n = len(x_list)
    x_bar = sum(x_list)/n
    y_bar = sum(y_list)/n
    x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
    y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
    zx = [(xi-x_bar)/x_std for xi in x_list]
    zy = [(yi-y_bar)/y_std for yi in y_list]
    r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
    return r**2
    
def get_r2_numpy_manual(x, y):
    zx = (x-np.mean(x))/np.std(x, ddof=1)
    zy = (y-np.mean(y))/np.std(y, ddof=1)
    r = np.sum(zx*zy)/(len(x)-1)
    return r**2
    
def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2
    
print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)

7/28/21 Benchmark results. (Python 3.7, numpy 1.19, scipy 1.6, statsmodels 0.12)

Python
2.41 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy polyfit
318 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Manual
79.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy corrcoef
83.8 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Scipy
221 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Statsmodels
375 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)