Skip to contents

Perform component selection by cross-validation backward approach

Usage

scglrThemeBackward(
  formula,
  data,
  H,
  family,
  size = NULL,
  weights = NULL,
  offset = NULL,
  na.action = na.omit,
  crit = list(),
  method = methodSR(),
  folds = 10,
  type = "mspe",
  st = FALSE
)

Arguments

formula

an object of class "Formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under Details.

data

data frame.

H

vector of R integer. Number of components to keep for each theme

family

a vector of character of the same length as the number of dependent variables: "bernoulli", "binomial", "poisson" or "gaussian" is allowed.

size

describes the number of trials for the binomial dependent variables. A (number of statistical units * number of binomial dependent variables) matrix is expected.

weights

weights on individuals (not available for now)

offset

used for the poisson dependent variables. A vector or a matrix of size: number of observations * number of Poisson dependent variables is expected.

na.action

a function which indicates what should happen when the data contain NAs. The default is set to na.omit.

crit

a list of two elements : maxit and tol, describing respectively the maximum number of iterations and the tolerance convergence criterion for the Fisher scoring algorithm. Default is set to 50 and 10e-6 respectively.

method

structural relevance criterion. Object of class "method.SCGLR" built by methodSR for Structural Relevance.

folds

number of folds - default is 10. Although folds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is folds=2. folds can also be provided as a vector (same length as data) of fold identifiers.

type

loss function to use for cross-validation. Currently six options are available depending on whether the responses are of the same distribution family. If the responses are all bernoulli distributed, then the prediction performance may be measured through the area under the ROC curve: type = "auc" In any other case one can choose among the following five options ("likelihood","aic","aicc","bic","mspe").

st

logical (FALSE) theme build and fit order. TRUE means random, FALSE means sequential (T1, ..., Tr)

Value

a list containing the path followed along the selection process, the associated mean square predictor error and the best configuration.

Details

Models for theme are specified symbolically.

A model as the form response ~ terms where response is the numeric response vector and terms is a series of R themes composed of predictors.

Themes are separated by "|" (pipe) and are composed.
y1 + y2 + ... ~ x11 + x12 + ... + x1_ | x21 + x22 + ... | ... + x1_ + ... | a1 + a2 + ...

See multivariateFormula.

Examples

if (FALSE) { # \dontrun{
library(SCGLR)

# load sample data
data(genus)

# get variable names from dataset
n <- names(genus)
n <- n[!n %in% c("geology","surface","lon","lat","forest","altitude")]
ny <- n[grep("^gen",n)]    # Y <- names that begins with "gen"
nx1 <- n[grep("^evi",n)]   # X <- remaining names
nx2 <- n[-c(grep("^evi",n),grep("^gen",n))]


form <- multivariateFormula(ny,nx1,nx2,A=c("geology"))
fam <- rep("poisson",length(ny))
testcv <- scglrThemeBackward(form,data=genus,H=c(2,2),family=fam,offset = genus$surface,folds=3)

# Cross-validation pathway
testcv$H_path

# Plot criterion
plot(testcv$cv_path)

# Best combination
testcv$H_best
} # }