Multivariate Adaptive Regression Splines (MARS)
How would you like a modeling technique that provides all of the following?
- Offers the flexibility to build linear and nonlinear models for both regression and classification
- Can support variable interaction terms
- Is simple to understand and explain
- Requires little data preprocessing
- Handles all types of data: numeric, factors, and so on
- Performs well on unseen data, that is, it does well in bias-variance trade-off
If that all sounds appealing, then I cannot recommend the use of MARS models enough. The method was brought to my attention several months ago, and I have found it to perform extremely well. In fact, in a recent case of mine, it outperformed both a random forest and boosted trees on test/validation data. It has quickly become my baseline model and all others are competitors. The other benefit I've seen is that it has negated much of the feature engineering I was doing. Much of that was using Weight-of-Evidence (WOE) and Information Values (IV) to capture nonlinearity and recode the variables. This WOE/IV technique was something I was planning to write about at length in this second edition. However, I've done a number of tests, and MARS does an excellent job of doing what that technique does (that is, capture non-linearity), so I will not discuss WOE/IV at all.
To understand MARS is quite simple. First, just start with a linear or generalized linear model like we discussed previously. Then, to capture any nonlinear relationship, a hinge function is added. These hinges are simply points in the input feature that equate to a coefficient change. For example, say we have Y = 12.5 (our intercept) + 1.5(variable 1) + 3.3(variable 2); where variables 1 and 2 are on a scale of 1 to 10. Now, let's see how a hinge function for variable 2 could come into play:
Y = 11 (new intercept) + 1.5(variable 1) + 4.26734(max(0, variable 2 - 5.5)
Thus, we read the hinge function as: we take the maximum of either 0 or variable 2 minus 5.50. So, whenever variable 2 has a value greater than 5.5, that value will be multiplied times the coefficient; otherwise, it will be zero. The method will accommodate multiple hinges for each variable and also interaction terms.
The other interesting thing about MARS is automatic variable selection. This can be done with cross-validation, but the default is to build through a forward pass, much like forward step wise regression, then a backward pass to prune the model, which after the forward pass is likely to overfit the data. This backward pass prunes input features and removes hinges based on Generalized Cross Validation (GCV):
GCV = RSS / (N * (1 - Effective Number of Parameters / N)2)
Effective Number of Parameters = Number of Input Features + Penalty * (Number of Input Features - 1) / 2
In the earth package, Penalty = 2 for additive model and 3 for multiplicative model, that is one with interaction terms.
In R, there are quite a few parameters you can tune. I will demonstrate in the example an effective and simple way to implement the methodology. If you so desire, you can learn more about its flexibility in the excellent online Notes on the earth package, by Stephen Milborrow, available at this link:
http://www.milbo.org/doc/earth-notes.pdf
With that introduction out of the way, let's get started. You can use the MDA package, but I learned on earth, so that is what I will present. The code is similar to the earlier examples, where we used glm(). However, it is important to specify how you want the model pruned and that it is a binomial response variable. Here, I specify a model selection of a five-fold cross validation (pmethod = "cv" and nfold = 5), repeated 3 times (ncross = 3), as an additive model only with no interactions (degree = 1) and only one hinge per input feature (minspan = -1). In the data I've been working with, both interaction terms and multiple hinges have led to overfitting. Of course, your results may vary. The code is as follows:
> library(earth)
> set.seed(1)
> earth.fit <- earth(class ~ ., data = train,
pmethod = "cv",
nfold = 5,
ncross = 3,
degree = 1,
minspan = -1,
glm=list(family=binomial)
)
We now examine the model summary. At first, it can seem a little confusing. Of course, we have the model formula and the logistic coefficients, including the hinge functions, followed by some commentary and numbers related to generalized R-squared, and so on. What happens is that a MARS model is built on the data first as a standard linear regression where the response variable is internally coded to 0's and 1's. After feature/variable pruning with final model creation, then it is translated to a GLM. So, just ignore the R-squared values:
> summary(earth.fit)
Call: earth(formula=class~., data=train,
pmethod="cv",
glm=list(family=binomial), degree=1, ncross=3,
nfold=5, minspan=-1)
GLM coefficients
malignant
(Intercept) -6.5746417
u.size 0.1502747
adhsn 0.3058496
s.size 0.3188098
nucl 0.4426061
n.nuc 0.2307595
h(thick-3) 0.7019053
h(3-chrom) -0.6927319
Earth selected 8 of 10 terms, and 7 of 9 predictors
using pmethod="cv"
Termination condition: RSq changed by less than
0.001 at 10 terms
Importance: nucl, u.size, thick, n.nuc, chrom,
s.size, adhsn,
u.shape-unused, ...
Number of terms at each degree of interaction: 1 7
(additive model)
Earth GRSq 0.8354593 RSq 0.8450554 mean.oof.RSq
0.8331308 (sd 0.0295)
GLM null.deviance 620.9885 (473 dof) deviance
81.90976 (466 dof)
iters 8
pmethod="backward" would have selected the same
model:
8 terms 7 preds, GRSq 0.8354593 RSq 0.8450554
mean.oof.RSq
0.8331308
The model gives us eight terms, including the Intercept and seven predictors. Two of the predictors have hinge functions--thickness and chromatin. If the thickness is greater than 3, the coefficient of 0.7019 is multiplied by that value; otherwise, it is 0. For chromatin, if less than 3 then the coefficient is multiplied by the values; otherwise, it is 0.
Plots are available. The first one using the plotmo() function produces plots showing the model's response when varying that predictor and holding the others constant. You can clearly see the hinge function at work for thickness:
> plotmo(earth.fit)
The following is the output of the preceding command:
With plotd(), you can see density plots of the predicted probabilities by class label:
> plotd(earth.fit)
The following is the output of the preceding command:
One can look at relative variable importance. Here we see the variable name, nsubsets, which is the number of model subsets that include the variable after the pruning pass, and the gcv and rss columns show the decrease in the respective value that the variable contributes (gcv and rss are scaled 0 to 100):
> evimp(earth.fit)
nsubsets gcv rss
nucl 7 100.0 100.0
u.size 6 44.2 44.8
thick 5 23.8 25.1
n.nuc 4 15.1 16.8
chrom 3 8.3 10.7
s.size 2 6.0 8.1
adhsn 1 2.3 4.6
Let's see how well it did on the test dataset:
> test.earth.probs <- predict(earth.fit, newdata =
test, type = "response")
> misClassError(testY, test.earth.probs)
[1] 0.0287
> confusionMatrix(testY, test.earth.probs)
0 1
0 138 2
1 4 65
This is very comparable to our logistic regression models. We can now compare the models and see what our best choice would be.