The sample dataset used in the vignettes is from publication by Gaydosik AM et al. It can be downloaded from here. Once downloaded, the file can be unzipped by tar xvzf GSE128531.tar.gz
. There is no need to further unzip the individual .csv.gz
files. To reduce computation time, the dataset only includes 6 samples and 300 cells for each sample. The complete raw data are available at GEO.
The sample data are in .csv.gz
format, but Scillus also works for 10x Genomics cellranger output format like below:
$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
Given the fact that the sample sizes for each scRNA-seq project are usually more than one, the loading and filtering functions of Scillus will generate and process a list of Seurat objects. A metadata dataframe should be provided to the loading function. The dataframe should have at least two columns: one named sample
, and the other named file
or folder
, depending on the input data format. For the sample data, metadata can be constructed by:
library(tidyverse)
a <- list.files("your/path/to/sample/data/GSE128531_RAW", full.names = TRUE)
m <- tibble(file = a,
sample = stringr::str_remove(basename(a), ".csv.gz"),
group = rep(c("CTCL", "Normal"), each = 3))
file | sample | group |
---|---|---|
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-5.csv.gz | CTCL-5 | CTCL |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-6.csv.gz | CTCL-6 | CTCL |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-8.csv.gz | CTCL-8 | CTCL |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-1.csv.gz | HC-1 | Normal |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-2.csv.gz | HC-2 | Normal |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-3.csv.gz | HC-3 | Normal |
Any additional metadata at the sample level (e.g. group, sex, age, treatment) can be appended to the dataframe, and they will be included in the Seurat object during the loading. Here, the group
column is added for demonstration purpose.
For 10x Genomics cellranger output, the metadata should have one column named folder
instead of file
like Table 2, and there should be 3 files in each folder as the tree structure shown above.
folder | sample | group |
---|---|---|
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-5 | CTCL-5 | CTCL |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-6 | CTCL-6 | CTCL |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/CTCL-8 | CTCL-8 | CTCL |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-1 | HC-1 | Normal |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-2 | HC-2 | Normal |
/Users/mxu3/Documents/projects/Scillus/test/GSE128531_RAW/HC-3 | HC-3 | Normal |
The plotting functions of Scillus will incorporate palette setup to ensure consistent coloring schemes for each variable in different types of plots.
pal <- tibble(var = c("sample", "group","seurat_clusters"),
pal = c("Set2","Set1","Paired"))
var | pal |
---|---|
sample | Set2 |
group | Set1 |
seurat_clusters | Paired |