descope.tokenizer

class TokenizerForATAC(cell_line_ft: str | AnnData, topk_ccres: int = 50000, pert_col: str = 'perturbation')[source]

Bases: object

A tokenizer class specifically designed for ATAC-seq data to preprocess and tokenize datasets for pretraining.

Example:

>>> from descope.tokenizer import TokenizerForATAC

>>> tokenizer = TokenizerForATAC(
...     cell_line_ft="path/to/finetune_data.h5ad",
...     topk_ccres=50000,
...     pert_col="perturbation"
... )

>>> tokenizer.tokenize(
...     cell_line_pt=["path/to/pretrain_data1.h5ad", "path/to/pretrain_data2.h5ad"],
...     cell_line_name=["cell_line_pretrain1", "cell_line_pretrain2"],
...     pert_col=["perturbation1", "perturbation2"],
...     save_dir="./tokenized_dataset",
...     apply_pert_gene_filter=False,
...     chunk_size=20000
... )

tokenize(cell_line_pt: list[str | AnnData], cell_line_name: list[str], pert_col: list[str] | None = None, save_dir: str = './tokenized_datasets', apply_pert_gene_filter: bool = True, chunk_size: int = 20000)[source]

Tokenizes multiple pretraining ATAC-seq datasets into Hugging Face Datasets format.

Parameters:

cell_line_ptlist[Union[str, sc.AnnData]]: A list of paths to .h5ad files or AnnData objects representing pretraining cell lines.
cell_line_namelist[str]: A list of names corresponding to each pretraining dataset; used as subdirectory names when saving tokenized data.
pert_collist[str] or None, optional (default: None): A list of column names in the .obs attribute of each AnnData object indicating the perturbation labels.

If None, the pert_col specified during tokenizer initialization is used for all datasets.
save_dirstr, optional (default: “./tokenized_datasets”): Directory path where the tokenized datasets will be saved.
apply_pert_gene_filterbool, optional (default: True): Whether to filter out cells with perturbations not present in the finetune dataset.

If True, only cells with perturbations in self.pert_genes_list are retained.
chunk_sizeint, default=20000: Number of cells to process per chunk during dataset construction.

Helps reduce memory overhead when handling large AnnData objects.

class TokenizerForRNA(cell_line_ft: str | AnnData, target_sum: float = 10000.0, pert_col: str = 'gene', gene_names_col: str | None = None, normalize_before_align_features: bool = False)[source]

Bases: object

A tokenizer class specifically designed for RNA-seq (scRNA-seq) perturbation data to preprocess and tokenize datasets for pretraining.

Example:

>>> from descope.tokenizer import TokenizerForRNA
>>> from descope.utils import DuplicatedFeatureHandling

>>> tokenizer = TokenizerForRNA(
...     cell_line_ft="path/to/finetune_data.h5ad",
...     target_sum=1e4,
...     pert_col="gene",
...     gene_names_col="gene_symbols",
...     normalize_before_align_features=False,
... )

>>> tokenizer.tokenize(
...     cell_line_pt=["path/to/pretrain_data1.h5ad", "path/to/pretrain_data2.h5ad"],
...     cell_line_name=["cell_line_pretrain1", "cell_line_pretrain2"],
...     pert_col=["gene", "gene"],
...     gene_names_col=["gene_name", "gene_name"],
...     save_dir="./tokenized_dataset",
...     apply_pert_gene_filter=False,
...     duplicated_features_handling=DuplicatedFeatureHandling.mean_pooling,
...     skip_raw_counts_check=True,
...     chunk_size=20000
... )

tokenize(cell_line_pt: list[str | AnnData], cell_line_name: list[str], pert_col: list[str] | None = None, gene_names_col: list[str] | None = None, save_dir: str = './tokenized_datasets', apply_pert_gene_filter: bool = True, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling, skip_raw_counts_check: bool = False, chunk_size: int = 20000)[source]

Tokenizes multiple pretraining scRNA-seq datasets into Hugging Face Datasets format.

Parameters:

cell_line_ptlist[Union[str, sc.AnnData]]: A list of paths to .h5ad files or AnnData objects representing pretraining cell lines.
cell_line_namelist[str]: A list of names corresponding to each pretraining dataset; used as subdirectory names when saving tokenized data.
pert_collist[str] or None, optional (default: None): A list of column names in the .obs attribute of each AnnData object indicating the perturbation labels.

If None, the pert_col specified during tokenizer initialization is used for all datasets.
gene_names_collist[str] or None, optional (default: None): A list of column names in the .var attribute of each AnnData object that contain gene symbols.

Used to standardize gene names before alignment. If None, assumes var_names are already gene symbols.
save_dirstr, optional (default: “./tokenized_datasets”): Directory path where the tokenized datasets will be saved.
apply_pert_gene_filterbool, optional (default: True): Whether to filter out cells with perturbations not present in the finetune dataset.

If True, only cells with perturbations in self.pert_genes_list are retained.
duplicated_features_handlingDuplicatedFeatureHandling, optional (default: max_pooling): Strategy for handling duplicated gene names.
skip_raw_counts_checkbool, optional (default: False): Whether to skip the raw counts check during preprocessing.

If you are sure that your data is raw counts, you can set this to True to skip the check.
chunk_sizeint, default=20000: Number of cells to process per chunk during dataset construction.

Helps reduce memory overhead when handling large AnnData objects.

tokenize_adata_to_hf_dataset(adata: AnnData, cell_line_name: str | list[str] | None = None, cell_line_col: str | None = None, pert_col: str = 'gene', chunk_size: int = 20000) → Dataset[source]

Convert an AnnData object into a Hugging Face Dataset for downstream modeling.

This function processes single-cell gene expression data stored in an AnnData object by extracting expression vectors and associated perturbation labels, then packages them into a Hugging Face Dataset. To manage memory usage for large datasets, the conversion is performed in chunks.

Parameters

adatasc.AnnData: Single-cell dataset in AnnData format containing gene expression matrix (in .X)

and perturbation annotations (in .obs[pert_col]).
cell_line_namestr, list[str], optional: Name of the cell line or cell type associated with all cells in the AnnData object.

If a string, it will be used as the celltype for all cells.

If a list, its length must match the number of cells in adata.

If None, cell_line_col must be provided.
cell_line_colstr, optional: Column name in adata.obs that contains the cell line name for each cell.

If provided, each cell’s celltype will be read from this column.

Mutually exclusive with cell_line_name.
pert_colstr, default=”gene”: Column name in adata.obs that contains the perturbation labels (e.g., gene names)
chunk_sizeint, default=20000: Number of cells to process per chunk during dataset construction.

Helps reduce memory overhead when handling large AnnData objects.

Returns

dsdatasets.Dataset: A Hugging Face Dataset with the following columns:

- “labels”: Gene expression vector for each cell (as a list of floats).

- “pert_gene”: Perturbation label (e.g., gene name or “control”).

- “celltype”: Cell line name for each cell.

Notes

If adata.X is sparse (e.g., scipy sparse matrix), it is converted to a dense array using .toarray() before processing.
The resulting dataset is suitable for use with Hugging Face Transformers or other deep learning pipelines that expect dictionary-like batched inputs.

tokenize_adata_to_hf_dataset_for_atac(adata: str | AnnData, cell_line_name: str | list[str] | None = None, cell_line_col: str | None = None, perts_to_include: list[str] | None = None, perts_to_exclude: list[str] | None = None, topk_ccres: int = 50000, pert_col: str = 'perturbation', ctrl_name: str = 'control', save_dir: str = './tokenized_dataset', chunk_size: int = 20000) → Dataset[source]

Preprocess and tokenize ATAC-seq perturbation data into a Hugging Face Dataset.

Parameters

adatastr or sc.AnnData: Input AnnData (or path to .h5ad) with ATAC profiles and perturbation labels.
cell_line_namestr, list[str], optional: Cell line name for all cells, or a list of cell line names for each cell.

If None, cell_line_col must be provided.
cell_line_colstr, optional: Column name in adata.obs containing cell line names for each cell.

Mutually exclusive with cell_line_name.
perts_to_include / perts_to_excludelist of str, optional: Mutually exclusive filters for perturbations. Only one may be specified.

If both are None, all perturbations are retained.

The control condition (ctrl_name) is always preserved regardless of the filter.
topk_ccresint, default=50000: Number of top variable cCREs to retain.
pert_colstr, default=”perturbation”: Column in adata.obs storing perturbation names.
ctrl_namestr, default=”control”: Label for control cells.
save_dirstr, default=”./tokenized_dataset”: Output directory for the saved dataset.
chunk_sizeint, default=20000: Number of cells to process per chunk during dataset construction.

Helps reduce memory overhead when handling large AnnData objects.

Returns

datasets.Dataset: Tokenized dataset saved to save_dir, with columns: “labels”, “pert_gene”, “celltype”.

tokenize_adata_to_hf_dataset_for_rna(adata: str | AnnData, cell_line_name: str | list[str] | None = None, cell_line_col: str | None = None, perts_to_include: list[str] | None = None, perts_to_exclude: list[str] | None = None, target_sum: float = 10000.0, pert_col: str = 'gene', ctrl_name: str = 'non-targeting', skip_raw_counts_check: bool = False, save_dir: str = './tokenized_dataset', chunk_size: int = 20000) → Dataset[source]

Preprocess and tokenize scRNA-seq perturbation data into a Hugging Face Dataset.

Parameters

adatastr or sc.AnnData: Input AnnData (or path to .h5ad) with raw gene counts and perturbation labels.
cell_line_namestr, list[str], optional: Cell line name for all cells, or a list of cell line names for each cell.

If None, cell_line_col must be provided.
cell_line_colstr, optional: Column name in adata.obs containing cell line names for each cell.

Mutually exclusive with cell_line_name.
perts_to_include / perts_to_excludelist of str, optional: Mutually exclusive filters for perturbations. Only one may be specified.

If both are None, all perturbations are retained.

The control condition (ctrl_name) is always preserved regardless of the filter.
target_sumfloat, default=1e4: Total count per cell after normalization (CPM-like scaling).
pert_colstr, default=”gene”: Column in adata.obs containing perturbation identifiers.
ctrl_namestr, default=”non-targeting”: Label for control cells.
skip_raw_counts_checkbool, default=False: Skip assertion that input counts are integers (use only if data is pre-validated).
save_dirstr, default=”./tokenized_dataset”: Directory to save the resulting dataset.
chunk_sizeint, default=20000: Number of cells to process per chunk during dataset construction.

Helps reduce memory overhead when handling large AnnData objects.

Returns

datasets.Dataset: Tokenized dataset saved to save_dir, with columns: “labels”, “pert_gene”, “celltype”.