descope.tokenizer

class TokenizerForATAC(cell_line_ft: str | AnnData, topk_ccres: int = 50000, pert_col: str = 'perturbation')[source]

Bases: object

A tokenizer class specifically designed for ATAC-seq data to preprocess and tokenize datasets for pretraining.

Example:

>>> from descope.tokenizer import TokenizerForATAC
>>> tokenizer = TokenizerForATAC(
...     cell_line_ft="path/to/finetune_data.h5ad",
...     topk_ccres=50000,
...     pert_col="perturbation"
... )
>>> tokenizer.tokenize(
...     cell_line_pt=["path/to/pretrain_data1.h5ad", "path/to/pretrain_data2.h5ad"],
...     cell_line_name=["cell_line_pretrain1", "cell_line_pretrain2"],
...     pert_col=["perturbation1", "perturbation2"],
...     save_dir="./tokenized_dataset",
...     apply_pert_gene_filter=False,
...     chunk_size=20000
... )
tokenize(cell_line_pt: list[str | AnnData], cell_line_name: list[str], pert_col: list[str] | None = None, save_dir: str = './tokenized_datasets', apply_pert_gene_filter: bool = True, chunk_size: int = 20000)[source]

Tokenizes multiple pretraining ATAC-seq datasets into Hugging Face Datasets format.

Parameters:

cell_line_ptlist[Union[str, sc.AnnData]]
A list of paths to .h5ad files or AnnData objects representing pretraining cell lines.
cell_line_namelist[str]
A list of names corresponding to each pretraining dataset; used as subdirectory names when saving tokenized data.
pert_collist[str] or None, optional (default: None)
A list of column names in the .obs attribute of each AnnData object indicating the perturbation labels.
If None, the pert_col specified during tokenizer initialization is used for all datasets.
save_dirstr, optional (default: “./tokenized_datasets”)
Directory path where the tokenized datasets will be saved.
apply_pert_gene_filterbool, optional (default: True)
Whether to filter out cells with perturbations not present in the finetune dataset.
If True, only cells with perturbations in self.pert_genes_list are retained.
chunk_sizeint, default=20000
Number of cells to process per chunk during dataset construction.
Helps reduce memory overhead when handling large AnnData objects.
class TokenizerForRNA(cell_line_ft: str | AnnData, target_sum: float = 10000.0, pert_col: str = 'gene', gene_names_col: str | None = None, normalize_before_align_features: bool = False)[source]

Bases: object

A tokenizer class specifically designed for RNA-seq (scRNA-seq) perturbation data to preprocess and tokenize datasets for pretraining.

Example:

>>> from descope.tokenizer import TokenizerForRNA
>>> from descope.utils import DuplicatedFeatureHandling
>>> tokenizer = TokenizerForRNA(
...     cell_line_ft="path/to/finetune_data.h5ad",
...     target_sum=1e4,
...     pert_col="gene",
...     gene_names_col="gene_symbols",
...     normalize_before_align_features=False,
... )
>>> tokenizer.tokenize(
...     cell_line_pt=["path/to/pretrain_data1.h5ad", "path/to/pretrain_data2.h5ad"],
...     cell_line_name=["cell_line_pretrain1", "cell_line_pretrain2"],
...     pert_col=["gene", "gene"],
...     gene_names_col=["gene_name", "gene_name"],
...     save_dir="./tokenized_dataset",
...     apply_pert_gene_filter=False,
...     duplicated_features_handling=DuplicatedFeatureHandling.mean_pooling,
...     skip_raw_counts_check=True,
...     chunk_size=20000
... )
tokenize(cell_line_pt: list[str | AnnData], cell_line_name: list[str], pert_col: list[str] | None = None, gene_names_col: list[str] | None = None, save_dir: str = './tokenized_datasets', apply_pert_gene_filter: bool = True, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling, skip_raw_counts_check: bool = False, chunk_size: int = 20000)[source]

Tokenizes multiple pretraining scRNA-seq datasets into Hugging Face Datasets format.

Parameters:

cell_line_ptlist[Union[str, sc.AnnData]]
A list of paths to .h5ad files or AnnData objects representing pretraining cell lines.
cell_line_namelist[str]
A list of names corresponding to each pretraining dataset; used as subdirectory names when saving tokenized data.
pert_collist[str] or None, optional (default: None)
A list of column names in the .obs attribute of each AnnData object indicating the perturbation labels.
If None, the pert_col specified during tokenizer initialization is used for all datasets.
gene_names_collist[str] or None, optional (default: None)
A list of column names in the .var attribute of each AnnData object that contain gene symbols.
Used to standardize gene names before alignment. If None, assumes var_names are already gene symbols.
save_dirstr, optional (default: “./tokenized_datasets”)
Directory path where the tokenized datasets will be saved.
apply_pert_gene_filterbool, optional (default: True)
Whether to filter out cells with perturbations not present in the finetune dataset.
If True, only cells with perturbations in self.pert_genes_list are retained.
duplicated_features_handlingDuplicatedFeatureHandling, optional (default: max_pooling)
Strategy for handling duplicated gene names.
skip_raw_counts_checkbool, optional (default: False)
Whether to skip the raw counts check during preprocessing.
If you are sure that your data is raw counts, you can set this to True to skip the check.
chunk_sizeint, default=20000
Number of cells to process per chunk during dataset construction.
Helps reduce memory overhead when handling large AnnData objects.
tokenize_adata_to_hf_dataset(adata: AnnData, cell_line_name: str, pert_col: str = 'gene', chunk_size: int = 20000) Dataset[source]

Convert an AnnData object into a Hugging Face Dataset for downstream modeling.

This function processes single-cell gene expression data stored in an AnnData object by extracting expression vectors and associated perturbation labels, then packages them into a Hugging Face Dataset. To manage memory usage for large datasets, the conversion is performed in chunks.

Parameters

adatasc.AnnData
Single-cell dataset in AnnData format containing gene expression matrix (in .X)
and perturbation annotations (in .obs[pert_col]).
cell_line_namestr
Name of the cell line or cell type associated with all cells in the AnnData object.
This will be added as a constant metadata field (“celltype”) in the output dataset.
pert_colstr, default=”gene”
Column name in adata.obs that contains the perturbation labels (e.g., gene names)
chunk_sizeint, default=20000
Number of cells to process per chunk during dataset construction.
Helps reduce memory overhead when handling large AnnData objects.

Returns

dsdatasets.Dataset
A Hugging Face Dataset with the following columns:
- “labels”: Gene expression vector for each cell (as a list of floats).
- “pert_gene”: Perturbation label (e.g., gene name or “control”).
- “celltype”: Constant string indicating the cell line name provided as input.

Notes

  • If adata.X is sparse (e.g., scipy sparse matrix), it is converted to a dense array using .toarray() before processing.

  • The resulting dataset is suitable for use with Hugging Face Transformers or other deep learning pipelines that expect dictionary-like batched inputs.

tokenize_adata_to_hf_dataset_for_atac(adata: str | AnnData, cell_line_name: str, perts_to_include: list[str] | None = None, perts_to_exclude: list[str] | None = None, topk_ccres: int = 50000, pert_col: str = 'perturbation', ctrl_name: str = 'control', save_dir: str = './tokenized_dataset', chunk_size: int = 20000) Dataset[source]

Preprocess and tokenize ATAC-seq perturbation data into a Hugging Face Dataset.

Parameters

adatastr or sc.AnnData
Input AnnData (or path to .h5ad) with ATAC profiles and perturbation labels.
cell_line_namestr
Cell line name; saved as “celltype” in the dataset.
perts_to_include / perts_to_excludelist of str, optional
Mutually exclusive filters for perturbations. Only one may be specified.
If both are None, all perturbations are retained.
The control condition (ctrl_name) is always preserved regardless of the filter.
topk_ccresint, default=50000
Number of top variable cCREs to retain.
pert_colstr, default=”perturbation”
Column in adata.obs storing perturbation names.
ctrl_namestr, default=”control”
Label for control cells.
save_dirstr, default=”./tokenized_dataset”
Output directory for the saved dataset.
chunk_sizeint, default=20000
Number of cells to process per chunk during dataset construction.
Helps reduce memory overhead when handling large AnnData objects.

Returns

datasets.Dataset
Tokenized dataset saved to save_dir, with columns: “labels”, “pert_gene”, “celltype”.
tokenize_adata_to_hf_dataset_for_rna(adata: str | AnnData, cell_line_name: str, perts_to_include: list[str] | None = None, perts_to_exclude: list[str] | None = None, target_sum: float = 10000.0, pert_col: str = 'gene', ctrl_name: str = 'non-targeting', skip_raw_counts_check: bool = False, save_dir: str = './tokenized_dataset', chunk_size: int = 20000) Dataset[source]

Preprocess and tokenize scRNA-seq perturbation data into a Hugging Face Dataset.

Parameters

adatastr or sc.AnnData
Input AnnData (or path to .h5ad) with raw gene counts and perturbation labels.
cell_line_namestr
Cell line name; stored as “celltype” in the output dataset.
perts_to_include / perts_to_excludelist of str, optional
Mutually exclusive filters for perturbations. Only one may be specified.
If both are None, all perturbations are retained.
The control condition (ctrl_name) is always preserved regardless of the filter.
target_sumfloat, default=1e4
Total count per cell after normalization (CPM-like scaling).
pert_colstr, default=”gene”
Column in adata.obs containing perturbation identifiers.
ctrl_namestr, default=”non-targeting”
Label for control cells.
skip_raw_counts_checkbool, default=False
Skip assertion that input counts are integers (use only if data is pre-validated).
save_dirstr, default=”./tokenized_dataset”
Directory to save the resulting dataset.
chunk_sizeint, default=20000
Number of cells to process per chunk during dataset construction.
Helps reduce memory overhead when handling large AnnData objects.

Returns

datasets.Dataset
Tokenized dataset saved to save_dir, with columns: “labels”, “pert_gene”, “celltype”.