Sample Data

The sample dataset used in the vignettes is from publication by Gaydosik AM et al. It can be downloaded from here. Once downloaded, the file can be unzipped by tar xvzf GSE128531.tar.gz. There is no need to further unzip the individual .csv.gz files. To reduce computation time, the dataset only includes 6 samples and 300 cells for each sample. The complete raw data are available at GEO.

The sample data are in .csv.gz format, but Scillus also works for 10x Genomics cellranger output format like below:

$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz

Metadata

Given the fact that the sample sizes for each scRNA-seq project are usually more than one, the loading and filtering functions of Scillus will generate and process a list of Seurat objects. A metadata dataframe should be provided to the loading function. The dataframe should have at least two columns: one named sample, and the other named file or folder, depending on the input data format. For the sample data, metadata can be constructed by:

library(tidyverse)

a <- list.files("your/path/to/sample/data/GSE128531_RAW", full.names = TRUE)
m <- tibble(file = a, 
            sample = stringr::str_remove(basename(a), ".csv.gz"),
            group = rep(c("CTCL", "Normal"), each = 3))
Table 1. Metadata of sample data
file sample group
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-5.csv.gz CTCL-5 CTCL
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-6.csv.gz CTCL-6 CTCL
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-8.csv.gz CTCL-8 CTCL
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-1.csv.gz HC-1 Normal
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-2.csv.gz HC-2 Normal
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-3.csv.gz HC-3 Normal

Any additional metadata at the sample level (e.g. group, sex, age, treatment) can be appended to the dataframe, and they will be included in the Seurat object during the loading. Here, the group column is added for demonstration purpose.

For 10x Genomics cellranger output, the metadata should have one column named folder instead of file like Table 2, and there should be 3 files in each folder as the tree structure shown above.

Table 2. Metadata for 10x Genomics cellranger output
folder sample group
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-5 CTCL-5 CTCL
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-6 CTCL-6 CTCL
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-8 CTCL-8 CTCL
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-1 HC-1 Normal
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-2 HC-2 Normal
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-3 HC-3 Normal

Palette setup

The plotting functions of Scillus will incorporate palette setup to ensure consistent coloring schemes for each variable in different types of plots.

pal <- tibble(var = c("sample", "group","seurat_clusters"),
              pal = c("Set2","Set1","Paired"))
Table 3. Palette setup
var pal
sample Set2
group Set1
seurat_clusters Paired