descope.utils
- class DuplicatedFeatureHandling(value)[source]
Bases:
EnumAn enumeration.
- max_pooling = 'max_pooling'
- mean_pooling = 'mean_pooling'
- class UniformFeatureForAnnData(input_h5ad: str | AnnData, feature_names_col: str | None = None, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling)[source]
Bases:
objectAlign an AnnData object to a predefined set of feature names (e.g., genes, ccres etc.) by reindexing and zero-padding missing features.
This class ensures that the input AnnData is transformed to have exactly the same feature space as a provided target list. Features present in the target but missing in the input are filled with zeros; features not in the target are dropped. Duplicate feature names in the input are resolved via mean- or max-pooling before alignment.
Example:
Suppose you have an AnnData with 3 cells and genes [‘A’, ‘B’, ‘C’]:
>>> import scanpy as sc >>> import numpy as np >>> from descope.utils import UniformFeatureForAnnData, DuplicatedFeatureHandling
>>> X = np.array([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9]]) # shape: (3, 3) >>> adata = sc.AnnData(X) >>> adata.var_names = ['A', 'B', 'C']
>>> processor = UniformFeatureForAnnData(adata) >>> target_genes = ['B', 'C', 'D'] # Note: 'D' is not in original adata >>> aligned = processor(target_genes)
>>> print(aligned.X) [[2. 3. 0.] [5. 6. 0.] [8. 9. 0.]]
>>> print(aligned.var_names.tolist()) ['B', 'C', 'D']
Here, gene ‘A’ is dropped, ‘B’ and ‘C’ are kept in the new order, and ‘D’ (missing in input) is added as a zero column.
Parameters:
- input_h5adstr or AnnData
- Input data, either a path to an H5AD file or an in-memory AnnData object to be aligned.
- feature_names_colstr or None, optional (default: None)
- Column name in adata.var to use as feature identifiers (e.g., “gene_symbol”, “cCRE_id”).If None, the existing adata.var_names are used directly.
- duplicated_features_handlingDuplicatedFeatureHandling, optional (default: DuplicatedFeatureHandling.max_pooling)
- Strategy for resolving duplicated feature names:- DuplicatedFeatureHandling.mean_pooling: replace duplicates with their mean expression across cells.- DuplicatedFeatureHandling.max_pooling: retain the duplicate showing the highest average expression.
- save(save_path: str, convert_to_sparse: bool = False)[source]
Save the aligned AnnData (self.output_adata) to an H5AD file.
Parameters:
- save_pathstr
- File path where the output AnnData will be saved (in H5AD format).
- convert_to_sparsebool, optional (default: False)
- If True, converts the dense matrix X in output_adata to a sparse CSR matrix before saving.This can significantly reduce file size when the data is large and sparse-friendly.
- build_de(adata: str | AnnData, control_pert: str = 'non-targeting', pert_col: str = 'target_gene', de_method: str = 'wilcoxon', num_threads: int = 1, batch_size: int = 100, outdir: str | None = None, allow_discrete: bool = False, pdex_kwargs: dict[str, Any] | None = None, check_log_norm: bool = True) DataFrame[source]
Compute differential expression (DE) for all perturbations against a control group using scalable parallelized backend.
- check_adata_format_consistent_with_epiagent(adata: AnnData)[source]
Validate that the input AnnData conforms to the expected format for EpiAgent.
Specifically checks: 1. That the number of features (columns in X) equals the fixed cCRE count (1,355,445). 2. That the feature matrix X contains continuous (floating-point) values, as required after TF-IDF transformation.
Parameters:
- adataAnnData
- Input AnnData object to validate.
- direction_match_on_topk_de(data: PerturbationAnndataPair, de_real: str, topk: int = 100, separate_up_down_regulated: bool = False) tuple[dict[str, float], float][source]
- direction_match_on_topk_de(data: PerturbationAnndataPair, de_real: str, topk: int = 100, separate_up_down_regulated: bool = True) tuple[dict[str, float], dict[str, float], float, float]
Compute direction match accuracy (sign agreement) between predicted and real perturbation effects on top-k differentially expressed (DE) genes. Recommended to use in conjunction with cell-eval. (https://github.com/ArcInstitute/cell-eval)
For each perturbation: 1. Load its top-k DE genes from a precomputed CSV (de_real) ranked by FDR. 2. Optionally split them into up-regulated (percent_change > 0) and down-regulated (percent_change < 0) sets. 3. Compute perturbation effect as: Δ = bulk_perturbation - bulk_control for both real and predicted data. 4. Measure the fraction of genes where the sign of Δ_pred matches the sign of Δ_real.
Returns either: A single accuracy per perturbation (if separate_up_down_regulated=False), or Separate accuracies for up- and down-regulated genes (if True).
Parameters:
- dataPerturbationAnndataPair
- evaluator.anndata_pair, see the example for details.
- de_realstr
- DE CSV file generated by cell-eval from the real AnnData.
- topkint, optional (default: 100)
- Number of top DE genes (lowest FDR) to consider in each direction (or overall if not separated).
- separate_up_down_regulatedbool, optional (default: True)
- If True, compute direction accuracy separately for up- and down-regulated genes.If False, compute a single accuracy over the top-k DE genes regardless of regulation direction.
Returns:
- If separate_up_down_regulated=False:
- tuple[dict[str, float], float]
- - Per-perturbation direction match accuracy (fraction of genes with matching sign).- Mean accuracy across all perturbations.
- If separate_up_down_regulated=True:
- tuple[dict[str, float], dict[str, float], float, float]
- - Per-perturbation accuracy on up-regulated genes.- Per-perturbation accuracy on down-regulated genes.- Mean accuracy on up-regulated genes.- Mean accuracy on down-regulated genes.
Example:
>>> from cell_eval import MetricsEvaluator
>>> # Initialize evaluator >>> evaluator = MetricsEvaluator(...)
>>> # Using separate up/down evaluation (default) >>> up_acc, down_acc, mean_up, mean_down = direction_match_on_topk_de( ... data=evaluator.anndata_pair, ... de_real="de_results.csv", # pre-computed DE results from cell-eval ... topk=100, ... separate_up_down_regulated=True ... ) >>> print(f"Up-regulated direction accuracy: {mean_up:.4f}") >>> print(f"Down-regulated direction accuracy: {mean_down:.4f}")
>>> # Or using combined evaluation >>> acc, mean_acc = direction_match_on_topk_de( ... data=evaluator.anndata_pair, ... de_real="de_results.csv", ... topk=100, ... separate_up_down_regulated=False ... ) >>> print(f"Overall direction accuracy: {mean_acc:.4f}")
- edistance(adata: str | AnnData, control_pert: str = 'non-targeting', pert_col: str = 'target_gene', metric: str = 'euclidean', **kwargs) dict[str, float][source]
Compute energy distance (E-distance) between each perturbation and the control group.
The E-distance is defined as: 2 * E[||X - Y||] - E[||X - X’||] - E[||Y - Y’||], where X is the perturbation, Y is the control, and X’, Y’ are independent copies. This implementation reuses the precomputed within-control distance (sigma_y) for efficiency.
Parameters:
- adatastr or AnnData
- Path to H5AD file or in-memory AnnData object.
- control_pertstr, optional (default: “non-targeting”)
- Value in adata.obs[pert_col] that identifies the control group.
- pert_colstr, optional (default: “target_gene”)
- Column name in adata.obs indicating the perturbation condition.
- metricstr, optional (default: “euclidean”)
- Distance metric passed to sklearn.metrics.pairwise_distances.Supported values include “euclidean”, “cosine”, etc.
- **kwargsdict
- Additional keyword arguments passed to pairwise_distances.
Returns:
- dict[str, float]
- Mapping from perturbation name (str) to its E-distance from the control group (float).
- intersect_adatas_for_celltype_transfer(adata_source: str | AnnData, adata_target: str | AnnData, pert_col: str = 'gene', ctrl_name: str = 'non-targeting', feature_names_col: str | None = None, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling, gene_embs_file: str | None = None, save_dir: str = './intersection-outdir')[source]
Intersect two AnnData objects (source and target) on both perturbations and features to prepare them for cell type transfer.
Parameters:
- adata_sourcestr or AnnData
- Path to H5AD file or in-memory AnnData object representing the source adata.
- adata_targetstr or AnnData
- Path to H5AD file or in-memory AnnData object representing the target adata.
- pert_colstr, optional (default: “gene”)
- Column name in adata.obs that indicates the perturbation or condition label.
- ctrl_namestr, optional (default: “non-targeting”)
- Name used to denote control (unperturbed) cells. Must be present in both adatas.
- feature_names_colstr or None, optional (default: None)
- Column name in adata.var to use as feature identifiers (e.g., “gene_symbol”).If None, uses adata.var_names directly.
- duplicated_features_handlingDuplicatedFeatureHandling, optional (default: DuplicatedFeatureHandling.max_pooling)
- Strategy to resolve duplicate feature names in either adata before alignment.
- gene_embs_filestr or None, optional (default: None)
- Optional path to a PyTorch file containing gene embeddings (as a dict: {gene: embedding}).If provided, only perturbations present in this file (plus ctrl_name) are retained.
- save_dirstr, optional (default: “./intersection-outdir”)
- Directory where the processed AnnData files and metadata CSV will be saved.
Outputs:
- The function saves the following to save_dir:
- - source.h5ad: filtered and feature-aligned source AnnData.- target.h5ad: filtered and feature-aligned target AnnData.- target_info.csv: a table listing each perturbation in the target and its cell count.
- load_gene_embs(gene_embs_file: str, perts_to_emb: list | None = None) dict[str, Tensor][source]
Load gene embeddings from a PyTorch file and ensure coverage for a given set of perturbations. Handles single-gene and multi-gene perturbations. Missing embeddings are replaced by zero vectors for absent genes, and multi-gene perturbations are computed as the mean of all constituent genes (including zeros for missing genes).
Parameters:
- gene_embs_filestr
- Path to a .pt or .pth file containing a dictionary {gene_name: embedding_tensor}.
- perts_to_emblist[str] or None, optional (default: None)
- List of perturbation (gene) names that must be present in the output embedding dictionary.Multi-gene perturbations must be denoted with ‘+’ (e.g., “GeneA+GeneB”).If None, simply returns all embeddings from the file without modification.
Returns:
- dict[str, torch.Tensor]
- A dictionary mapping gene names to their embedding tensors.Guaranteed to include all entries in perts_to_emb (if not None).
- load_gene_names_engine(fp: str | None = None) list[str] | None[source]
Load a list of gene names from a plain-text or CSV file.
Supports two common formats: 1. Single-column file (TXT/CSV): each line contains one gene name. 2. Multi-column CSV with header: must include a column named “target_gene”; other columns (e.g., “n_cells”) are ignored.
Parameters:
- fpstr or None, optional (default: None)
- Path to the input file. If None, returns None.
Returns:
- list[str] or None
- A deduplicated list of gene names as strings. Returns None if fp is None.
- pearson(data: PerturbationAnndataPair) tuple[dict[str, float], float][source]
Compute Pearson correlation between predicted and real bulk expression profiles for each perturbation. Recommended to use in conjunction with cell-eval. (https://github.com/ArcInstitute/cell-eval)
For each perturbation: 1. Compute the bulk (mean) expression profile across all cells under that perturbation, separately for the predicted (pert_pred) and real (pert_real) adata. 2. Calculate the Pearson correlation coefficient between these two bulk vectors.
Returns per-perturbation correlations and their average.
Parameters:
- dataPerturbationAnndataPair
- evaluator.anndata_pair, see the example for details.
Returns:
- tuple[dict[str, float], float]
- - First element: a dictionary mapping perturbation names (str) to Pearson correlation coefficients (float).- Second element: the mean of all per-perturbation correlations.
Example:
>>> from cell_eval import MetricsEvaluator
>>> # Initialize evaluator >>> evaluator = MetricsEvaluator(...)
>>> pert_corr, mean_corr = pearson(data=evaluator.anndata_pair) >>> print(f"Mean Pearson correlation across perturbations: {mean_corr:.4f}")
- pearson_delta_on_topk_de(data: PerturbationAnndataPair, de_real: str, topk: int = 20) tuple[dict[str, float], float][source]
Compute Pearson correlation between predicted and real perturbation effects on the top-k differentially expressed (DE) genes. Recommended to use in conjunction with cell-eval. (https://github.com/ArcInstitute/cell-eval)
For each perturbation: 1. Load its top-k DE genes from a precomputed CSV (de_real) ranked by FDR. 2. Compute the bulk profile for the perturbation (mean expression across its cells) and for the control group. 3. Derive the perturbation effect as: Δ = bulk_perturbation - bulk_control, separately for real and predicted data. 4. Compute Pearson correlation between the real and predicted Δ vectors over the top-k DE genes. Returns per-perturbation correlations and their average.
Parameters:
- dataPerturbationAnndataPair
- evaluator.anndata_pair, see the example for details.
- de_realstr
- DE CSV file generated by cell-eval from the real AnnData.
- topkint, optional (default: 20)
- Number of top DE genes (lowest FDR) to use for correlation per perturbation.
Returns:
- tuple[dict[str, float], float]
- - First element: a dictionary mapping perturbation names (str) to Pearson correlation coefficients (float).- Second element: the mean of all per-perturbation correlations.
Example:
>>> from cell_eval import MetricsEvaluator
>>> # Initialize evaluator >>> evaluator = MetricsEvaluator(...)
>>> pearson_delta_on_topk_de, pearson_delta_on_topk_de_mean = pearson_delta_on_topk_de( ... data=evaluator.anndata_pair, ... de_real="de_results.csv", # pre-computed DE results from cell-eval ... topk=20 ... )
>>> print(f"Mean Pearson Delta on top-20 DE genes: {pearson_delta_on_topk_de_mean:.4f}")
- preprocess_atac_perturbation_adata_consistent_with_epiagent(adata: AnnData, topk_ccres: int = 50000, pert_col: str = 'perturbation') AnnData[source]
Preprocess an ATAC-seq perturbation AnnData to be compatible with EpiAgent.
This function performs three steps: 1. Validates that the input matches EpiAgent’s expected cCRE dimensionality and data type. 2. Selects the top topk_ccres cCREs by total signal across cells. 3. Removes cells with missing (NaN) values in the perturbation column.
Parameters:
- adataAnnData
- Input ATAC-seq AnnData object, expected to contain ~1.36M cCREs and TF-IDF-transformed continuous values.Ensure that the AnnData object meets the requirements of EpiAgent.
- topk_ccresint, optional (default: 50000)
- Number of most active cCREs to retain.
- pert_colstr, optional (default: “perturbation”)
- Column name in adata.obs indicating the perturbation condition.
Returns:
- AnnData
- Preprocessed AnnData ready for use with EpiAgent: filtered to top cCREs and cleaned of NaN perturbations.
- preprocess_rna_perturbation_adata(adata: AnnData, target_sum: float = 10000.0, pert_col: str = 'target_gene', skip_raw_counts_check: bool = False) AnnData[source]
Preprocess an RNA-seq perturbation AnnData for downstream analysis.
This function: 1. Removes cells with missing perturbation labels in obs[pert_col]. 2. Applies library-size normalization (normalize_total) followed by log1p, unless the data appears already log-normalized (detected heuristically).
Parameters:
- adataAnnData
- Input RNA-seq AnnData object containing raw or normalized counts.
- target_sumfloat, optional (default: 1e4)
- Target total count per cell for library-size normalization.
- pert_colstr, optional (default: “target_gene”)
- Column name in adata.obs indicating the perturbed gene or condition.
- skip_raw_counts_checkbool, optional (default: False)
- If True, skips the heuristic check for log-normalization and always applieslibrary-size normalization (sc.pp.normalize_total) followed by log1p.Use this only when you are certain the input contains raw, unnormalized counts.If False (default), the function checks whether the data appears to be alreadylog-normalized (e.g., via log1p of normalized counts). If so, normalization andlog-transformation are skipped; otherwise, they are applied.
Returns:
- AnnData
- Cleaned and normalized AnnData, with NaN-labeled cells removed and expression log-transformed.
- select_topk_ccres(adata: AnnData, topk_ccres: int = 50000) AnnData[source]
Select the top-k cCREs (peaks) with highest total signal across all cells.
The selection is based on the column-wise sum of adata.X. A stable sort is used to ensure reproducibility when ties occur in peak sums.
Parameters:
- adataAnnData
- Input AnnData object containing cCRE-level features in X.
- topk_ccresint, optional (default: 50000)
- Number of top cCREs to retain.
Returns:
- AnnData
- A subset of the input adata containing only the top topk_ccres cCREs, in descending order of total signal.
NOTE: This selection procedure is identical to that used in EpiAgent, with the addition of a stable sort to ensure that the top-k cCREs are always deterministically selected.