descope.tokenizer
- class TokenizerForATAC(cell_line_ft: str | AnnData, topk_ccres: int = 50000, pert_col: str = 'perturbation')[source]
Bases:
objectA tokenizer class specifically designed for ATAC-seq data to preprocess and tokenize datasets for pretraining.
Example:
>>> from descope.tokenizer import TokenizerForATAC
>>> tokenizer = TokenizerForATAC( ... cell_line_ft="path/to/finetune_data.h5ad", ... topk_ccres=50000, ... pert_col="perturbation" ... )
>>> tokenizer.tokenize( ... cell_line_pt=["path/to/pretrain_data1.h5ad", "path/to/pretrain_data2.h5ad"], ... cell_line_name=["cell_line_pretrain1", "cell_line_pretrain2"], ... pert_col=["perturbation1", "perturbation2"], ... save_dir="./tokenized_dataset", ... apply_pert_gene_filter=False, ... chunk_size=20000 ... )
- tokenize(cell_line_pt: list[str | AnnData], cell_line_name: list[str], pert_col: list[str] | None = None, save_dir: str = './tokenized_datasets', apply_pert_gene_filter: bool = True, chunk_size: int = 20000)[source]
Tokenizes multiple pretraining ATAC-seq datasets into Hugging Face Datasets format.
Parameters:
- cell_line_ptlist[Union[str, sc.AnnData]]
- A list of paths to .h5ad files or AnnData objects representing pretraining cell lines.
- cell_line_namelist[str]
- A list of names corresponding to each pretraining dataset; used as subdirectory names when saving tokenized data.
- pert_collist[str] or None, optional (default: None)
- A list of column names in the .obs attribute of each AnnData object indicating the perturbation labels.If None, the pert_col specified during tokenizer initialization is used for all datasets.
- save_dirstr, optional (default: “./tokenized_datasets”)
- Directory path where the tokenized datasets will be saved.
- apply_pert_gene_filterbool, optional (default: True)
- Whether to filter out cells with perturbations not present in the finetune dataset.If True, only cells with perturbations in self.pert_genes_list are retained.
- chunk_sizeint, default=20000
- Number of cells to process per chunk during dataset construction.Helps reduce memory overhead when handling large AnnData objects.
- class TokenizerForRNA(cell_line_ft: str | AnnData, target_sum: float = 10000.0, pert_col: str = 'gene', gene_names_col: str | None = None, normalize_before_align_features: bool = False)[source]
Bases:
objectA tokenizer class specifically designed for RNA-seq (scRNA-seq) perturbation data to preprocess and tokenize datasets for pretraining.
Example:
>>> from descope.tokenizer import TokenizerForRNA >>> from descope.utils import DuplicatedFeatureHandling
>>> tokenizer = TokenizerForRNA( ... cell_line_ft="path/to/finetune_data.h5ad", ... target_sum=1e4, ... pert_col="gene", ... gene_names_col="gene_symbols", ... normalize_before_align_features=False, ... )
>>> tokenizer.tokenize( ... cell_line_pt=["path/to/pretrain_data1.h5ad", "path/to/pretrain_data2.h5ad"], ... cell_line_name=["cell_line_pretrain1", "cell_line_pretrain2"], ... pert_col=["gene", "gene"], ... gene_names_col=["gene_name", "gene_name"], ... save_dir="./tokenized_dataset", ... apply_pert_gene_filter=False, ... duplicated_features_handling=DuplicatedFeatureHandling.mean_pooling, ... skip_raw_counts_check=True, ... chunk_size=20000 ... )
- tokenize(cell_line_pt: list[str | AnnData], cell_line_name: list[str], pert_col: list[str] | None = None, gene_names_col: list[str] | None = None, save_dir: str = './tokenized_datasets', apply_pert_gene_filter: bool = True, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling, skip_raw_counts_check: bool = False, chunk_size: int = 20000)[source]
Tokenizes multiple pretraining scRNA-seq datasets into Hugging Face Datasets format.
Parameters:
- cell_line_ptlist[Union[str, sc.AnnData]]
- A list of paths to .h5ad files or AnnData objects representing pretraining cell lines.
- cell_line_namelist[str]
- A list of names corresponding to each pretraining dataset; used as subdirectory names when saving tokenized data.
- pert_collist[str] or None, optional (default: None)
- A list of column names in the .obs attribute of each AnnData object indicating the perturbation labels.If None, the pert_col specified during tokenizer initialization is used for all datasets.
- gene_names_collist[str] or None, optional (default: None)
- A list of column names in the .var attribute of each AnnData object that contain gene symbols.Used to standardize gene names before alignment. If None, assumes var_names are already gene symbols.
- save_dirstr, optional (default: “./tokenized_datasets”)
- Directory path where the tokenized datasets will be saved.
- apply_pert_gene_filterbool, optional (default: True)
- Whether to filter out cells with perturbations not present in the finetune dataset.If True, only cells with perturbations in self.pert_genes_list are retained.
- duplicated_features_handlingDuplicatedFeatureHandling, optional (default: max_pooling)
- Strategy for handling duplicated gene names.
- skip_raw_counts_checkbool, optional (default: False)
- Whether to skip the raw counts check during preprocessing.If you are sure that your data is raw counts, you can set this to True to skip the check.
- chunk_sizeint, default=20000
- Number of cells to process per chunk during dataset construction.Helps reduce memory overhead when handling large AnnData objects.
- tokenize_adata_to_hf_dataset(adata: AnnData, cell_line_name: str, pert_col: str = 'gene', chunk_size: int = 20000) Dataset[source]
Convert an AnnData object into a Hugging Face Dataset for downstream modeling.
This function processes single-cell gene expression data stored in an AnnData object by extracting expression vectors and associated perturbation labels, then packages them into a Hugging Face Dataset. To manage memory usage for large datasets, the conversion is performed in chunks.
Parameters
- adatasc.AnnData
- Single-cell dataset in AnnData format containing gene expression matrix (in .X)and perturbation annotations (in .obs[pert_col]).
- cell_line_namestr
- Name of the cell line or cell type associated with all cells in the AnnData object.This will be added as a constant metadata field (“celltype”) in the output dataset.
- pert_colstr, default=”gene”
- Column name in adata.obs that contains the perturbation labels (e.g., gene names)
- chunk_sizeint, default=20000
- Number of cells to process per chunk during dataset construction.Helps reduce memory overhead when handling large AnnData objects.
Returns
- dsdatasets.Dataset
- A Hugging Face Dataset with the following columns:- “labels”: Gene expression vector for each cell (as a list of floats).- “pert_gene”: Perturbation label (e.g., gene name or “control”).- “celltype”: Constant string indicating the cell line name provided as input.
Notes
If adata.X is sparse (e.g., scipy sparse matrix), it is converted to a dense array using .toarray() before processing.
The resulting dataset is suitable for use with Hugging Face Transformers or other deep learning pipelines that expect dictionary-like batched inputs.
- tokenize_adata_to_hf_dataset_for_atac(adata: str | AnnData, cell_line_name: str, perts_to_include: list[str] | None = None, perts_to_exclude: list[str] | None = None, topk_ccres: int = 50000, pert_col: str = 'perturbation', ctrl_name: str = 'control', save_dir: str = './tokenized_dataset', chunk_size: int = 20000) Dataset[source]
Preprocess and tokenize ATAC-seq perturbation data into a Hugging Face Dataset.
Parameters
- adatastr or sc.AnnData
- Input AnnData (or path to .h5ad) with ATAC profiles and perturbation labels.
- cell_line_namestr
- Cell line name; saved as “celltype” in the dataset.
- perts_to_include / perts_to_excludelist of str, optional
- Mutually exclusive filters for perturbations. Only one may be specified.If both are None, all perturbations are retained.The control condition (ctrl_name) is always preserved regardless of the filter.
- topk_ccresint, default=50000
- Number of top variable cCREs to retain.
- pert_colstr, default=”perturbation”
- Column in adata.obs storing perturbation names.
- ctrl_namestr, default=”control”
- Label for control cells.
- save_dirstr, default=”./tokenized_dataset”
- Output directory for the saved dataset.
- chunk_sizeint, default=20000
- Number of cells to process per chunk during dataset construction.Helps reduce memory overhead when handling large AnnData objects.
Returns
- datasets.Dataset
- Tokenized dataset saved to save_dir, with columns: “labels”, “pert_gene”, “celltype”.
- tokenize_adata_to_hf_dataset_for_rna(adata: str | AnnData, cell_line_name: str, perts_to_include: list[str] | None = None, perts_to_exclude: list[str] | None = None, target_sum: float = 10000.0, pert_col: str = 'gene', ctrl_name: str = 'non-targeting', skip_raw_counts_check: bool = False, save_dir: str = './tokenized_dataset', chunk_size: int = 20000) Dataset[source]
Preprocess and tokenize scRNA-seq perturbation data into a Hugging Face Dataset.
Parameters
- adatastr or sc.AnnData
- Input AnnData (or path to .h5ad) with raw gene counts and perturbation labels.
- cell_line_namestr
- Cell line name; stored as “celltype” in the output dataset.
- perts_to_include / perts_to_excludelist of str, optional
- Mutually exclusive filters for perturbations. Only one may be specified.If both are None, all perturbations are retained.The control condition (ctrl_name) is always preserved regardless of the filter.
- target_sumfloat, default=1e4
- Total count per cell after normalization (CPM-like scaling).
- pert_colstr, default=”gene”
- Column in adata.obs containing perturbation identifiers.
- ctrl_namestr, default=”non-targeting”
- Label for control cells.
- skip_raw_counts_checkbool, default=False
- Skip assertion that input counts are integers (use only if data is pre-validated).
- save_dirstr, default=”./tokenized_dataset”
- Directory to save the resulting dataset.
- chunk_sizeint, default=20000
- Number of cells to process per chunk during dataset construction.Helps reduce memory overhead when handling large AnnData objects.
Returns
- datasets.Dataset
- Tokenized dataset saved to save_dir, with columns: “labels”, “pert_gene”, “celltype”.