Main model training function for finding the best model that characterises a subpopulation

Training a haft of all cells to find optimal ElasticNet and LDA models to predict a subpopulation

training(
  genes = NULL,
  cluster_mixedpop1 = NULL,
  mixedpop1 = NULL,
  mixedpop2 = NULL,
  c_selectID = NULL,
  listData = list(),
  out_idx = 1,
  standardize = TRUE,
  trainset_ratio = 0.5,
  LDA_run = FALSE,
  log_transform = FALSE
)

Arguments

genes	a vector of gene names (for ElasticNet shrinkage); gene symbols must be in the same format with gene names in subpop2. Note that genes are listed by the order of importance, e.g. differentially expressed genes that are most significan, so that if the gene list contains too many genes, only the top 500 genes are used.
cluster_mixedpop1	a vector of cluster assignment in mixedpop1
mixedpop1	is a SingleCellExperiment object from the train mixed population
mixedpop2	is a SingleCellExperiment object from the target mixed population
c_selectID	a selected number to specify which subpopulation to be used for training
listData	list to store output in
out_idx	a number to specify index to write results into the list output. This is needed for running bootstrap.
standardize	a logical value specifying whether or not to standardize the train matrix
trainset_ratio	a number specifying the proportion of cells to be part of the training subpopulation
LDA_run	logical, if the LDA run is added to compare to ElasticNet
log_transform	boolean whether log transform should be computed

Value

a list with prediction results written in to the indexed out_idx

Author

Quan Nguyen, 2017-11-25

Examples


c_selectID<-1
out_idx<-1
day2 <- day_2_cardio_cell_sample
mixedpop1 <-new_scGPS_object(ExpressionMatrix = day2$dat2_counts, 
    GeneMetadata = day2$dat2geneInfo, CellMetadata = day2$dat2_clusters)
day5 <- day_5_cardio_cell_sample
mixedpop2 <-new_scGPS_object(ExpressionMatrix = day5$dat5_counts,
GeneMetadata = day5$dat5geneInfo, CellMetadata = day5$dat5_clusters)
genes <-training_gene_sample
genes <-genes$Merged_unique
listData  <- training(genes, 
    cluster_mixedpop1 = colData(mixedpop1)[, 1],
    mixedpop1 = mixedpop1, mixedpop2 = mixedpop2, c_selectID,
    listData =list(), out_idx=out_idx, trainset_ratio = 0.5)
#> Total 224 cells as source subpop
#> Total 366 cells in remaining subpops
#> subsampling 112 cells for training source subpop
#> subsampling 112 cells in remaining subpops for training
#> use 6 genes for training model
#> use 6 genes 224 cells for testing model
#> rename remaining subpops to 2_3
#> there are 112 cells in class 2_3 and 112 cells in class 1
#> removing 1 genes with no variance
#> standardizing prediction/target dataset
#> performning elasticnet model training...
#> extracting deviance and best gene features...
#> lambda min is at location 17
#> the leave-out cells in the source subpop is 112
#> use 112 target subpops cells for leave-out test set
#> standardizing the leave-out target and source subpops...
#> start ElasticNet prediction for estimating accuracy...
#> evaluation accuracy ElasticNet 0.660377358490566
names(listData)
#> [1] "Accuracy"        "ElasticNetGenes" "Deviance"        "ElasticNetFit"  
#> [5] "LDAFit"          "predictor_S1"   
listData$Accuracy
#> [[1]]
#> [[1]][[1]]
#> [[1]][[1]][[1]]
#> [1] 140
#> 
#> [[1]][[1]][[2]]
#> [1] 72
#> 
#> 
#>