Function that fits and selects the number of component by cross-validation.
Source:R/scglrCrossVal.r
scglrCrossVal.Rd
Function that fits and selects the number of component by cross-validation.
Arguments
- formula
an object of class "Formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted.
- data
the data frame to be modeled.
- family
a vector of character of length q specifying the distributions of the responses. Bernoulli, binomial, poisson and gaussian are allowed.
- K
number of components, default is one.
- folds
number of folds, default is 10. Although folds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. folds can also be provided as a vector (same length as data) of fold identifiers.
- type
loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured through the area under the ROC curve: type = "auc" In any other case one can choose among the following five options ("likelihood","aic","aicc","bic","mspe").
- size
specifies the number of trials of the binomial variables included in the model. A (n*qb) matrix is expected for qb binomial variables.
- offset
used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected.
- na.action
a function which indicates what should happen when the data contain NAs. The default is set to the
na.omit
.- crit
a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively.
- method
Regularization criterion type. Object of class "method.SCGLR" built by
methodSR
for Structural Relevance.- nfolds
deprecated. Use
fold
parameter instead.- mc.cores
deprecated
Value
a matrix containing the criterion values for each response (rows) and each number of components (columns).
References
Bry X., Trottier C., Verron T. and Mortier F. (2013) Supervised Component Generalized Linear Regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119, 47-60.
Examples
if (FALSE) { # \dontrun{
library(SCGLR)
# load sample data
data(genus)
# get variable names from dataset
n <- names(genus)
ny <- n[grep("^gen",n)] # Y <- names that begins with "gen"
nx <- n[-grep("^gen",n)] # X <- remaining names
# remove "geology" and "surface" from nx
# as surface is offset and we want to use geology as additional covariate
nx <-nx[!nx%in%c("geology","surface")]
# build multivariate formula
# we also add "lat*lon" as computed covariate
form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology"))
# define family
fam <- rep("poisson",length(ny))
# cross validation
genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12,
offset=genus$surface)
# find best K
mean.crit <- colMeans(log(genus.cv))
#plot(mean.crit, type="l")
} # }