bagging {ipred} | R Documentation |

Bagging for classification, regression and survival trees.

ipredbagg.factor(y, X=NULL, nbagg=25, control= rpart.control(minsplit=2, cp=0, xval=0), comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...) ipredbagg.numeric(y, X=NULL, nbagg=25, control=rpart.control(xval=0), comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...) ipredbagg.Surv(y, X=NULL, nbagg=25, control=rpart.control(xval=0), comb=NULL, coob=FALSE, ns=dim(y)[1], keepX = TRUE, ...) ## S3 method for class 'data.frame': bagging(formula, data, subset, na.action=na.rpart, ...)

`y` |
the response variable: either a factor vector of class labels
(bagging classification trees), a vector of numerical values
(bagging regression trees) or an object of class
`Surv` (bagging survival trees). |

`X` |
a data frame of predictor variables. |

`nbagg` |
an integer giving the number of bootstrap replications. |

`coob` |
a logical indicating whether an out-of-bag estimate of the
error rate (misclassification error, root mean squared error
or Brier score) should be computed.
See `predict.classbagg` for
details. |

`control` |
options that control details of the `rpart`
algorithm, see `rpart.control` . It is
wise to set `xval = 0` in order to save computing
time. Note that the
default values depend on the class of `y` . |

`comb` |
a list of additional models for model combination, see below
for some examples. Note that argument `method` for double-bagging is no longer there,
`comb` is much more flexible. |

`ns` |
number of sample to draw from the learning sample. By default,
the usual bootstrap n out of n with replacement is performed.
If `ns` is smaller than `length(y)` , subagging
(Buehlmann and Yu, 2002), i.e. sampling `ns` out of
`length(y)` without replacement, is performed. |

`keepX` |
a logical indicating whether the data frame of predictors
should be returned. Note that the computation of the
out-of-bag estimator requires `keepX=TRUE` . |

`formula` |
a formula of the form `lhs ~ rhs` where `lhs`
is the response variable and `rhs` a set of
predictors. |

`data` |
optional data frame containing the variables in the model formula. |

`subset` |
optional vector specifying a subset of observations to be used. |

`na.action` |
function which indicates what should happen when
the data contain `NA` s. Defaults to
`na.rpart` . |

`...` |
additional parameters passed to `ipredbagg` or
`rpart` , respectively. |

Bagging for classification and regression trees were suggested by Breiman (1996a, 1998) in order to stabilise trees.

The trees in this function are computed using the implementation in the
`rpart`

package. The generic function `ipredbagg`

implements methods for different responses. If `y`

is a factor,
classification trees are constructed. For numerical vectors
`y`

, regression trees are aggregated and if `y`

is a survival
object, bagging survival trees (Hothorn et al, 2003) is performed.
The function `bagging`

offers a formula based interface to
`ipredbagg`

.

`nbagg`

bootstrap samples are drawn and a tree is constructed
for each of them. There is no general rule when to stop the tree
growing. The size of the
trees can be controlled by `control`

argument
or `prune.classbagg`

. By
default, classification trees are as large as possible whereas regression
trees and survival trees are build with the standard options of
`rpart.control`

. If `nbagg=1`

, one single tree is
computed for the whole learning sample without bootstrapping.

If `coob`

is TRUE, the out-of-bag sample (Breiman,
1996b) is used to estimate the prediction error
corresponding to `class(y)`

. Alternatively, the out-of-bag sample can
be used for model combination, an out-of-bag error rate estimator is not
available in this case. Double-bagging (Hothorn and Lausen,
2003) computes a LDA on the out-of-bag sample and uses the discriminant
variables as additional predictors for the classification trees. `comb`

is an optional list of lists with two elements `model`

and `predict`

.
`model`

is a function with arguments `formula`

and `data`

.
`predict`

is a function with arguments `object, newdata`

only. If
the estimation of the covariance matrix in `lda`

fails due to a
limited out-of-bag sample size, one can use `slda`

instead.
See the example section for an example of double-bagging. The methodology is
not limited to a combination with LDA: bundling (Hothorn and Lausen, 2002b)
can be used with arbitrary classifiers.

The class of the object returned depends on `class(y)`

:
`classbagg, regbagg`

and `survbagg`

. Each is a list with elements

`y` |
the vector of responses. |

`X` |
the data frame of predictors. |

`mtrees` |
multiple trees: a list of length `nbagg` containing the
trees (and possibly additional objects) for each bootstrap sample. |

`OOB` |
logical whether the out-of-bag estimate should be computed. |

`err` |
if `OOB=TRUE` , the out-of-bag estimate of
misclassification or root mean squared error or the Brier score for censored
data. |

`comb` |
logical whether a combination of models was requested. |

For each class methods for the generics `prune`

,
`print`

, `summary`

and `predict`

are
available for inspection of the results and prediction, for example:
`print.classbagg`

, `summary.classbagg`

,
`predict.classbagg`

and `prune.classbagg`

for
classification problems.

Torsten.Hothorn <Torsten.Hothorn@rzmail.uni-erlangen.de>

Leo Breiman (1996a), Bagging Predictors. *Machine Learning*
**24**(2), 123–140.

Leo Breiman (1996b), Out-Of-Bag Estimation. *Technical Report*
ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.Z.

Leo Breiman (1998), Arcing Classifiers. *The Annals of Statistics*
**26**(3), 801–824.

Peter Buehlmann and Bin Yu (2002), Analyzing Bagging. *The Annals of
Statistics* **30**(4), 927–961.

Torsten Hothorn and Berthold Lausen (2003), Double-Bagging: Combining
classifiers by bootstrap aggregation. *Pattern Recognition*,
**36**(6), 1303–1309.

Torsten Hothorn and Berthold Lausen (2002b), Bundling Classifiers by Bagging
Trees. *submitted*.
Preprint available from
http://www.mathpreprints.com/math/Preprint/blausen/20021016/1.

Torsten Hothorn, Berthold Lausen, Axel Benner and Martin
Radespiel-Troeger (2004), Bagging Survival Trees.
*Statistics in Medicine*, **23**(1), 77–91.

# Classification: Breast Cancer data data(BreastCancer) # Test set error bagging (nbagg = 50): 3.7% (Breiman, 1998, Table 5) mod <- bagging(Class ~ Cl.thickness + Cell.size + Cell.shape + Marg.adhesion + Epith.c.size + Bare.nuclei + Bl.cromatin + Normal.nucleoli + Mitoses, data=BreastCancer, coob=TRUE) print(mod) # Test set error bagging (nbagg=50): 7.9% (Breiman, 1996a, Table 2) data(Ionosphere) Ionosphere$V2 <- NULL # constant within groups bagging(Class ~ ., data=Ionosphere, coob=TRUE) # Double-Bagging: combine LDA and classification trees # predict returns the linear discriminant values, i.e. linear combinations # of the original predictors comb.lda <- list(list(model=lda, predict=function(obj, newdata) predict(obj, newdata)$x)) # Note: out-of-bag estimator is not available in this situation, use # errorest mod <- bagging(Class ~ ., data=Ionosphere, comb=comb.lda) predict(mod, Ionosphere[1:10,]) # Regression: data(BostonHousing) # Test set error (nbagg=25, trees pruned): 3.41 (Breiman, 1996a, Table 8) mod <- bagging(medv ~ ., data=BostonHousing, coob=TRUE) print(mod) learn <- as.data.frame(mlbench.friedman1(200)) # Test set error (nbagg=25, trees pruned): 2.47 (Breiman, 1996a, Table 8) mod <- bagging(y ~ ., data=learn, coob=TRUE) print(mod) # Survival data # Brier score for censored data estimated by # 10 times 10-fold cross-validation: 0.2 (Hothorn et al, # 2002) data(DLBCL) mod <- bagging(Surv(time,cens) ~ MGEc.1 + MGEc.2 + MGEc.3 + MGEc.4 + MGEc.5 + MGEc.6 + MGEc.7 + MGEc.8 + MGEc.9 + MGEc.10 + IPI, data=DLBCL, coob=TRUE) print(mod)

[Package *ipred* version 0.8-1 Index]