# Ford True Code Software

On Thursday, October 16, 2015, a disbelieving student posted on Reddit My stats professor just went on a rant about how R-squared values are essentially useless, is there any truth to this? It attracted a fair amount of attention, at least compared to other posts about statistics on Reddit.

It turns out the student’s stats professor was Cosma Shalizi of Carnegie Mellon University. Shalizi provides free and open access to his class lecture materials so we can see what exactly he was “ranting” about. It all begins in Section 3.2 of his Lecture 10 notes.

CAN-BUS ISO 15765-4/SAE J2480: Used for all automobiles manufactured after 2008 as well as Ford cars and trucks manufactured in 2003 and later. The OBD2 connector must have pins 4, 5, 6, 14, and 16. How do you use a code reader and scanner? It is important to use the code reader and scan tool properly to ensure that you get accurate results. 1937 Ford ½-ton Pickup Truck. Series 830, Model 77. Next-to-last model year for the “21-stud” engine. 221 CID “21-stud” Flathead V-8 engine (code V8-78). Sliding gear three-speed manual transmission. Vermilion Red over black exterior. Red and black interior. Side-mounted passenger tire with wheel. 112-inch wheelbase As America emerged from the economic Great. The R-squared falls from 0.94 to 0.15 but the MSE remains the same. In other words the predictive ability is the same for both data sets, but the R-squared would lead you to believe the first example somehow had a model with more predictive power. View pricing options for the 2020 Ford® Edge SUV. Explore financing options, incentives, leasing options & more. Learn about Ford Sales Events & deals.

In case you forgot or didn’t know, R-squared is a statistic that often accompanies regression output. It ranges in value from 0 to 1 and is usually interpreted as summarizing the percent of variation in the response that the regression model explains. So an R-squared of 0.65 might mean that the model explains about 65% of the variation in our dependent variable. Given this logic, we prefer our regression models have a high R-squared. Shalizi, however, disputes this logic with convincing arguments.

In R, we typically get R-squared by calling the summary function on a model object. Here’s a quick example using simulated data:

One way to express R-squared is as the sum of squared fitted-value deviations divided by the sum of squared original-value deviations:

Five nights at pinkie%27s download game. \$\$R^{2} = frac{sum (hat{y} – bar{hat{y}})^{2}}{sum (y – bar{y})^{2}} \$\$

We can calculate it directly using our model object like so:

Now let’s take a look at a few of Shalizi’s statements about R-squared and demonstrate them with simulations in R. 1. R-squared does not measure goodness of fit. It can be arbitrarily low when the model is completely correct. By making (sigma^{2}) large, we drive R-squared towards 0, even when every assumption of the simple linear regression model is correct in every particular.

What is (sigma^{2})? When we perform linear regression, we assume our model almost predicts our dependent variable. The difference between “almost” and “exact” is assumed to be a draw from a Normal distribution with mean 0 and some variance we call (sigma^{2}).

Shalizi’s statement is easy enough to demonstrate. The way we do it here is to create a function that (1) generates data meeting the assumptions of simple linear regression (independent observations, normally distributed errors with constant variance), (2) fits a simple linear model to the data, and (3) reports the R-squared. Notice the only parameter for sake of simplicity is `sigma`. We then “apply” this function to a series of increasing (sigma) values and plot the results.

Sure enough, R-squared tanks hard with increasing sigma, even though the model is completely correct in every respect.

2. R-squared can be arbitrarily close to 1 when the model is totally wrong.

Again, the point being made is that R-squared does not measure goodness of fit. Here we use code from a different section of Shalizi’s lecture 10 notes to generate non-linear data. Now check the R-squared:

It’s very high at about 0.85, but the model is completely wrong. Using R-squared to justify the “goodness” of our model in this instance would be a mistake. Hopefully one would plot the data first and recognize that a simple linear regression in this case would be inappropriate.

3. R-squared says nothing about prediction error, even with (sigma^{2}) exactly the same, and no change in the coefficients. R-squared can be anywhere between 0 and 1 just by changing the range of X. We’re better off using Mean Square Error (MSE) as a measure of prediction error.

MSE is basically the fitted y values minus the observed y values, squared, then summed, and then divided by the number of observations.

Let’s demonstrate this statement by first generating data that meets all simple linear regression assumptions and then regressing y on x to assess both R-squared and MSE.

Now repeat the above code, but this time with a different range of x. Leave everything else the same:

The R-squared falls from 0.94 to 0.15 but the MSE remains the same. In other words the predictive ability is the same for both data sets, but the R-squared would lead you to believe the first example somehow had a model with more predictive power.

4. R-squared cannot be compared between a model with untransformed Y and one with transformed Y, or between different transformations of Y. R-squared can easily go down when the model assumptions are better fulfilled.

Let’s examine this by generating data that would benefit from transformation. Notice the R code below is very much like our previous efforts but now we exponentiate our y variable.

R-squared is very low and our residuals vs. fitted plot reveals outliers and non-constant variance. A common fix for this is to log transform the data. Let’s try that and see what happens:

The diagnostic plot looks much better. Our assumption of constant variance appears to be met. But look at the R-squared:

It’s even lower! This is an extreme case and it doesn’t always happen like this. In fact, a log transformation will usually produce an increase in R-squared. But as just demonstrated, assumptions that are better fulfilled don’t always lead to higher R-squared. And hence R-squared cannot be compared between models.

5. It is very common to say that R-squared is “the fraction of variance explained” by the regression. [Yet] if we regressed X on Y, we’d get exactly the same R-squared. This in itself should be enough to show that a high R-squared says nothing about explaining one variable by another.

This is the easiest statement to demonstrate:

Does x explain y, or does y explain x? Are we saying “explain” to dance around the word “cause”? In a simple scenario with two variables such as this, R-squared is simply the square of the correlation between x and y:

Why not just use correlation instead of R-squared in this case? But then again correlation summarizes linear relationships, which may not be appropriate for the data. This is another instance where plotting your data is strongly advised.

Let’s recap:

• R-squared does not measure goodness of fit.
• R-squared does not measure predictive error.
• R-squared does not allow you to compare models using transformed responses.
• R-squared does not measure how one variable explains another.

And that’s just what we covered in this article. Shalizi gives even more reasons in his lecture notes. And it should be noted that Adjusted R-squared does nothing to address any of these issues.

So is there any reason at all to use R-squared? Shalizi says no. (“I have never found a situation where it helped at all.”) No doubt, some statisticians and Redditors might disagree. Whatever your view, if you choose to use R-squared to inform your data analysis, it would be wise to double-check that it’s telling you what you think it’s telling you.

For questions or clarifications regarding this article, contact the UVA Library StatLab: [email protected]

View the entire collection of UVA Library StatLab articles.

Clay Ford

## Ford True Code Software

Statistical Research Consultant
University of Virginia Library
Oct 17, 2015