Regression is often used to simplify our understanding of how one variable affects another. Insurance, credit scores, and even car values are estimated using regression formulas. These are powerful statistical tools that allow us to look at hundreds, thousands, even millions of cases (for example, car sales) and extrapolate a “line of best fit” to enable us to estimate what value to assign to future cases.
So, for example, let’s say you own a Porsche and go shopping at Target to get some fancy seat covers for it. While you’re in the store, a mystery SUV totals your prized possession. How does your insurance company determine what it’s worth?
The insurance company is going to turn to one of the three major “blue book” services: Edmunds, Kelly Blue Book, and NADA. How do they derive the value? Each of these companies uses sales records to determine the value. But how can they possibly know how much the color changes the price? Antilock brakes? V4 vs. V6? Mileage? Location of the sale?
Well, it’s fairly simple on the surface. If we take all the cars sold of a particular model and tally all their equipment in a giant spreadsheet, then we could do a kind of filter command on it. So we could select all the Toyota Camry LEs with V4, air conditioning, and power locks/windows if that is what we were looking for, and filter for only those features, and get an average for that model. But how do we account for the condition, the location, or the mileage?
This is where regression comes in. As mentioned, the statistical process of regression is used to find a “line of best fit” in the data. So if we made a graph of all the sales of that make/model/year, and graphed the price on one axis and the mileage on another axis, we would get a graph that showed us a cross-tabulation of price and mileage. A regression line is defined as “the line of best fit,” which means that it does the best job of showing us an average of the values across the graph. In fact, the technical name for a regression line is the “line of least sum of squared differences” and that name is important for us to consider.
But what this means is that there is one important assumption. For the line to fit the least sum of squared differences, we must assume that there are no values that are “way off.” As long as most of the values are in the same range, there are no problems. But extreme values cause problems. These values, in fact, have a name: outliers. Outliers have the curious property of throwing off regressions. This is because the statistics are determining the line that fits the least squared differences. Because the differences are squared, one single value that is way out of range throws the regression line way off the rest of the data.
What do statisticians do with outliers? It’s a dirty little secret. They throw them out. Outliers don’t work in regression statistics. So cars with outlier values (extreme customizations, very low mileage, brand-new condition) may not “fit” the blue book regression—they were never in the regression to begin with. And what the blue book attempts to do is to extrapolate the extremes (the outliers), to estimate them, by extending the regression line beyond its original scope.
So if you’re trying to estimate what your car is worth, or if you suspect that your insurance company is way off in reimbursing you for the value of your car, look for outliers. My 1992 Toyota Hi-Lux truck has 91,000 miles on it. That’s an average of 4500 miles a year for 20 years. A regression formula can’t accommodate such an extreme value. This is why the blue books need to be taken with a grain of salt in these cases.
If your car has an unusual property such as very low mileage, and you are having trouble finding a fair value, you may be justified in ignoring the blue book, and requesting your insurance company to do the same.
And now you know the dirty little secret about regression.