Skip to contents

Function that fits and selects the number of component by cross-validation.

Usage

scglrCrossVal(
  formula,
  data,
  family,
  K = 1,
  folds = 10,
  type = "mspe",
  size = NULL,
  offset = NULL,
  na.action = na.omit,
  crit = list(),
  method = methodSR(),
  nfolds,
  mc.cores
)

Arguments

formula

an object of class "Formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted.

data

the data frame to be modeled.

family

a vector of character of length q specifying the distributions of the responses. Bernoulli, binomial, poisson and gaussian are allowed.

K

number of components, default is one.

folds

number of folds, default is 10. Although folds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. folds can also be provided as a vector (same length as data) of fold identifiers.

type

loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured through the area under the ROC curve: type = "auc" In any other case one can choose among the following five options ("likelihood","aic","aicc","bic","mspe").

size

specifies the number of trials of the binomial variables included in the model. A (n*qb) matrix is expected for qb binomial variables.

offset

used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected.

na.action

a function which indicates what should happen when the data contain NAs. The default is set to the na.omit.

crit

a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively.

method

Regularization criterion type. Object of class "method.SCGLR" built by methodSR for Structural Relevance.

nfolds

deprecated. Use fold parameter instead.

mc.cores

deprecated

Value

a matrix containing the criterion values for each response (rows) and each number of components (columns).

References

Bry X., Trottier C., Verron T. and Mortier F. (2013) Supervised Component Generalized Linear Regression using a PLS-extension of the Fisher scoring algorithm. Journal of Multivariate Analysis, 119, 47-60.

Examples

if (FALSE) {
library(SCGLR)

# load sample data
data(genus)

# get variable names from dataset
n <- names(genus)
ny <- n[grep("^gen",n)]    # Y <- names that begins with "gen"
nx <- n[-grep("^gen",n)]   # X <- remaining names

# remove "geology" and "surface" from nx
# as surface is offset and we want to use geology as additional covariate
nx <-nx[!nx%in%c("geology","surface")]

# build multivariate formula
# we also add "lat*lon" as computed covariate
form <- multivariateFormula(ny,c(nx,"I(lat*lon)"),A=c("geology"))

# define family
fam <- rep("poisson",length(ny))

# cross validation
genus.cv <- scglrCrossVal(formula=form, data=genus, family=fam, K=12,
 offset=genus$surface)

# find best K
mean.crit <- colMeans(log(genus.cv))

#plot(mean.crit, type="l")
}