descope.utils

class DuplicatedFeatureHandling(value)[source]

Bases: Enum

An enumeration.

max_pooling = 'max_pooling'

mean_pooling = 'mean_pooling'

class UniformFeatureForAnnData(input_h5ad: str | AnnData, feature_names_col: str | None = None, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling)[source]

Bases: object

Align an AnnData object to a predefined set of feature names (e.g., genes, ccres etc.) by reindexing and zero-padding missing features.

This class ensures that the input AnnData is transformed to have exactly the same feature space as a provided target list. Features present in the target but missing in the input are filled with zeros; features not in the target are dropped. Duplicate feature names in the input are resolved via mean- or max-pooling before alignment.

Example:

Suppose you have an AnnData with 3 cells and genes [‘A’, ‘B’, ‘C’]:

>>> import scanpy as sc
>>> import numpy as np
>>> from descope.utils import UniformFeatureForAnnData, DuplicatedFeatureHandling

>>> X = np.array([[1, 2, 3],
...               [4, 5, 6],
...               [7, 8, 9]])  # shape: (3, 3)
>>> adata = sc.AnnData(X)
>>> adata.var_names = ['A', 'B', 'C']

>>> processor = UniformFeatureForAnnData(adata)
>>> target_genes = ['B', 'C', 'D']  # Note: 'D' is not in original adata
>>> aligned = processor(target_genes)

>>> print(aligned.X)
[[2. 3. 0.]
 [5. 6. 0.]
 [8. 9. 0.]]

>>> print(aligned.var_names.tolist())
['B', 'C', 'D']

Here, gene ‘A’ is dropped, ‘B’ and ‘C’ are kept in the new order, and ‘D’ (missing in input) is added as a zero column.

Parameters:

input_h5adstr or AnnData: Input data, either a path to an H5AD file or an in-memory AnnData object to be aligned.
feature_names_colstr or None, optional (default: None): Column name in adata.var to use as feature identifiers (e.g., “gene_symbol”, “cCRE_id”).

If None, the existing adata.var_names are used directly.
duplicated_features_handlingDuplicatedFeatureHandling, optional (default: DuplicatedFeatureHandling.max_pooling): Strategy for resolving duplicated feature names:

- DuplicatedFeatureHandling.mean_pooling: replace duplicates with their mean expression across cells.

- DuplicatedFeatureHandling.max_pooling: retain the duplicate showing the highest average expression.

save(save_path: str, convert_to_sparse: bool = False)[source]

Save the aligned AnnData (self.output_adata) to an H5AD file.

Parameters:

save_pathstr: File path where the output AnnData will be saved (in H5AD format).
convert_to_sparsebool, optional (default: False): If True, converts the dense matrix X in output_adata to a sparse CSR matrix before saving.

This can significantly reduce file size when the data is large and sparse-friendly.

build_de(adata: str | AnnData, control_pert: str = 'non-targeting', pert_col: str = 'target_gene', de_method: str = 'wilcoxon', num_threads: int = 1, batch_size: int = 100, outdir: str | None = None, allow_discrete: bool = False, pdex_kwargs: dict[str, Any] | None = None, check_log_norm: bool = True) → DataFrame[source]: Compute differential expression (DE) for all perturbations against a control group using scalable parallelized backend.

check_adata_format_consistent_with_epiagent(adata: AnnData)[source]

Validate that the input AnnData conforms to the expected format for EpiAgent.

Specifically checks: 1. That the number of features (columns in X) equals the fixed cCRE count (1,355,445). 2. That the feature matrix X contains continuous (floating-point) values, as required after TF-IDF transformation.

Parameters:

adataAnnData: Input AnnData object to validate.

direction_match_on_topk_de(data: PerturbationAnndataPair, de_real: str, topk: int = 100, separate_up_down_regulated: bool = False) → tuple[dict[str, float], float][source]

direction_match_on_topk_de(data: PerturbationAnndataPair, de_real: str, topk: int = 100, separate_up_down_regulated: bool = True) → tuple[dict[str, float], dict[str, float], float, float]

Compute direction match accuracy (sign agreement) between predicted and real perturbation effects on top-k differentially expressed (DE) genes. Recommended to use in conjunction with cell-eval. (https://github.com/ArcInstitute/cell-eval)

For each perturbation: 1. Load its top-k DE genes from a precomputed CSV (de_real) ranked by FDR. 2. Optionally split them into up-regulated (percent_change > 0) and down-regulated (percent_change < 0) sets. 3. Compute perturbation effect as: Δ = bulk_perturbation - bulk_control for both real and predicted data. 4. Measure the fraction of genes where the sign of Δ_pred matches the sign of Δ_real.

Returns either: A single accuracy per perturbation (if separate_up_down_regulated=False), or Separate accuracies for up- and down-regulated genes (if True).

Parameters:

dataPerturbationAnndataPair: evaluator.anndata_pair, see the example for details.
de_realstr: DE CSV file generated by cell-eval from the real AnnData.
topkint, optional (default: 100): Number of top DE genes (lowest FDR) to consider in each direction (or overall if not separated).
separate_up_down_regulatedbool, optional (default: True): If True, compute direction accuracy separately for up- and down-regulated genes.

If False, compute a single accuracy over the top-k DE genes regardless of regulation direction.

Returns:

If separate_up_down_regulated=False:

tuple[dict[str, float], float]: - Per-perturbation direction match accuracy (fraction of genes with matching sign).

- Mean accuracy across all perturbations.

If separate_up_down_regulated=True:

tuple[dict[str, float], dict[str, float], float, float]: - Per-perturbation accuracy on up-regulated genes.

- Per-perturbation accuracy on down-regulated genes.

- Mean accuracy on up-regulated genes.

- Mean accuracy on down-regulated genes.

Example:

>>> from cell_eval import MetricsEvaluator

>>> # Initialize evaluator
>>> evaluator = MetricsEvaluator(...)

>>> # Using separate up/down evaluation (default)
>>> up_acc, down_acc, mean_up, mean_down = direction_match_on_topk_de(
...     data=evaluator.anndata_pair,
...     de_real="de_results.csv",  # pre-computed DE results from cell-eval
...     topk=100,
...     separate_up_down_regulated=True
... )
>>> print(f"Up-regulated direction accuracy: {mean_up:.4f}")
>>> print(f"Down-regulated direction accuracy: {mean_down:.4f}")

>>> # Or using combined evaluation
>>> acc, mean_acc = direction_match_on_topk_de(
...     data=evaluator.anndata_pair,
...     de_real="de_results.csv",
...     topk=100,
...     separate_up_down_regulated=False
... )
>>> print(f"Overall direction accuracy: {mean_acc:.4f}")

edistance(adata: str | AnnData, control_pert: str = 'non-targeting', pert_col: str = 'target_gene', metric: str = 'euclidean', **kwargs) → dict[str, float][source]

Compute energy distance (E-distance) between each perturbation and the control group.

The E-distance is defined as: 2 * E[||X - Y||] - E[||X - X’||] - E[||Y - Y’||], where X is the perturbation, Y is the control, and X’, Y’ are independent copies. This implementation reuses the precomputed within-control distance (sigma_y) for efficiency.

Parameters:

adatastr or AnnData: Path to H5AD file or in-memory AnnData object.
control_pertstr, optional (default: “non-targeting”): Value in adata.obs[pert_col] that identifies the control group.
pert_colstr, optional (default: “target_gene”): Column name in adata.obs indicating the perturbation condition.
metricstr, optional (default: “euclidean”): Distance metric passed to sklearn.metrics.pairwise_distances.

Supported values include “euclidean”, “cosine”, etc.
**kwargsdict: Additional keyword arguments passed to pairwise_distances.

Returns:

dict[str, float]: Mapping from perturbation name (str) to its E-distance from the control group (float).

intersect_adatas_for_celltype_transfer(adata_source: str | AnnData, adata_target: str | AnnData, pert_col: str = 'gene', ctrl_name: str = 'non-targeting', feature_names_col: str | None = None, duplicated_features_handling: DuplicatedFeatureHandling = DuplicatedFeatureHandling.max_pooling, gene_embs_file: str | None = None, save_dir: str = './intersection-outdir')[source]

Intersect two AnnData objects (source and target) on both perturbations and features to prepare them for cell type transfer.

Parameters:

adata_sourcestr or AnnData: Path to H5AD file or in-memory AnnData object representing the source adata.
adata_targetstr or AnnData: Path to H5AD file or in-memory AnnData object representing the target adata.
pert_colstr, optional (default: “gene”): Column name in adata.obs that indicates the perturbation or condition label.
ctrl_namestr, optional (default: “non-targeting”): Name used to denote control (unperturbed) cells. Must be present in both adatas.
feature_names_colstr or None, optional (default: None): Column name in adata.var to use as feature identifiers (e.g., “gene_symbol”).

If None, uses adata.var_names directly.
duplicated_features_handlingDuplicatedFeatureHandling, optional (default: DuplicatedFeatureHandling.max_pooling): Strategy to resolve duplicate feature names in either adata before alignment.
gene_embs_filestr or None, optional (default: None): Optional path to a PyTorch file containing gene embeddings (as a dict: {gene: embedding}).

If provided, only perturbations present in this file (plus ctrl_name) are retained.
save_dirstr, optional (default: “./intersection-outdir”): Directory where the processed AnnData files and metadata CSV will be saved.

Outputs:

The function saves the following to save_dir:: - source.h5ad: filtered and feature-aligned source AnnData.

- target.h5ad: filtered and feature-aligned target AnnData.

- target_info.csv: a table listing each perturbation in the target and its cell count.

load_gene_embs(gene_embs_file: str, perts_to_emb: list | None = None) → dict[str, Tensor][source]

Load gene embeddings from a PyTorch file and ensure coverage for a given set of perturbations. Handles single-gene and multi-gene perturbations. Missing embeddings are replaced by zero vectors for absent genes, and multi-gene perturbations are computed as the mean of all constituent genes (including zeros for missing genes).

Parameters:

gene_embs_filestr: Path to a .pt or .pth file containing a dictionary {gene_name: embedding_tensor}.
perts_to_emblist[str] or None, optional (default: None): List of perturbation (gene) names that must be present in the output embedding dictionary.

Multi-gene perturbations must be denoted with ‘+’ (e.g., “GeneA+GeneB”).

If None, simply returns all embeddings from the file without modification.

Returns:

dict[str, torch.Tensor]: A dictionary mapping gene names to their embedding tensors.

Guaranteed to include all entries in perts_to_emb (if not None).

load_gene_names_engine(fp: str | None = None) → list[str] | None[source]

Load a list of gene names from a plain-text or CSV file.

Supports two common formats: 1. Single-column file (TXT/CSV): each line contains one gene name. 2. Multi-column CSV with header: must include a column named “target_gene”; other columns (e.g., “n_cells”) are ignored.

Parameters:

fpstr or None, optional (default: None): Path to the input file. If None, returns None.

Returns:

list[str] or None: A deduplicated list of gene names as strings. Returns None if fp is None.

pearson(data: PerturbationAnndataPair) → tuple[dict[str, float], float][source]

Compute Pearson correlation between predicted and real bulk expression profiles for each perturbation. Recommended to use in conjunction with cell-eval. (https://github.com/ArcInstitute/cell-eval)

For each perturbation: 1. Compute the bulk (mean) expression profile across all cells under that perturbation, separately for the predicted (pert_pred) and real (pert_real) adata. 2. Calculate the Pearson correlation coefficient between these two bulk vectors.

Returns per-perturbation correlations and their average.

Parameters:

dataPerturbationAnndataPair: evaluator.anndata_pair, see the example for details.

Returns:

tuple[dict[str, float], float]: - First element: a dictionary mapping perturbation names (str) to Pearson correlation coefficients (float).

- Second element: the mean of all per-perturbation correlations.

Example:

>>> from cell_eval import MetricsEvaluator

>>> # Initialize evaluator
>>> evaluator = MetricsEvaluator(...)

>>> pert_corr, mean_corr = pearson(data=evaluator.anndata_pair)
>>> print(f"Mean Pearson correlation across perturbations: {mean_corr:.4f}")

pearson_delta_on_topk_de(data: PerturbationAnndataPair, de_real: str, topk: int = 20) → tuple[dict[str, float], float][source]

Compute Pearson correlation between predicted and real perturbation effects on the top-k differentially expressed (DE) genes. Recommended to use in conjunction with cell-eval. (https://github.com/ArcInstitute/cell-eval)

For each perturbation: 1. Load its top-k DE genes from a precomputed CSV (de_real) ranked by FDR. 2. Compute the bulk profile for the perturbation (mean expression across its cells) and for the control group. 3. Derive the perturbation effect as: Δ = bulk_perturbation - bulk_control, separately for real and predicted data. 4. Compute Pearson correlation between the real and predicted Δ vectors over the top-k DE genes. Returns per-perturbation correlations and their average.

Parameters:

dataPerturbationAnndataPair: evaluator.anndata_pair, see the example for details.
de_realstr: DE CSV file generated by cell-eval from the real AnnData.
topkint, optional (default: 20): Number of top DE genes (lowest FDR) to use for correlation per perturbation.

Returns:

tuple[dict[str, float], float]: - First element: a dictionary mapping perturbation names (str) to Pearson correlation coefficients (float).

- Second element: the mean of all per-perturbation correlations.

Example:

>>> from cell_eval import MetricsEvaluator

>>> # Initialize evaluator
>>> evaluator = MetricsEvaluator(...)

>>> pearson_delta_on_topk_de, pearson_delta_on_topk_de_mean = pearson_delta_on_topk_de(
...     data=evaluator.anndata_pair,
...     de_real="de_results.csv",  # pre-computed DE results from cell-eval
...     topk=20
... )

>>> print(f"Mean Pearson Delta on top-20 DE genes: {pearson_delta_on_topk_de_mean:.4f}")

preprocess_atac_perturbation_adata_consistent_with_epiagent(adata: AnnData, topk_ccres: int = 50000, pert_col: str = 'perturbation') → AnnData[source]

Preprocess an ATAC-seq perturbation AnnData to be compatible with EpiAgent.

This function performs three steps: 1. Validates that the input matches EpiAgent’s expected cCRE dimensionality and data type. 2. Selects the top topk_ccres cCREs by total signal across cells. 3. Removes cells with missing (NaN) values in the perturbation column.

Parameters:

adataAnnData: Input ATAC-seq AnnData object, expected to contain ~1.36M cCREs and TF-IDF-transformed continuous values.

Ensure that the AnnData object meets the requirements of EpiAgent.
topk_ccresint, optional (default: 50000): Number of most active cCREs to retain.
pert_colstr, optional (default: “perturbation”): Column name in adata.obs indicating the perturbation condition.

Returns:

AnnData: Preprocessed AnnData ready for use with EpiAgent: filtered to top cCREs and cleaned of NaN perturbations.

preprocess_rna_perturbation_adata(adata: AnnData, target_sum: float = 10000.0, pert_col: str = 'target_gene', skip_raw_counts_check: bool = False) → AnnData[source]

Preprocess an RNA-seq perturbation AnnData for downstream analysis.

This function: 1. Removes cells with missing perturbation labels in obs[pert_col]. 2. Applies library-size normalization (normalize_total) followed by log1p, unless the data appears already log-normalized (detected heuristically).

Parameters:

adataAnnData: Input RNA-seq AnnData object containing raw or normalized counts.
target_sumfloat, optional (default: 1e4): Target total count per cell for library-size normalization.
pert_colstr, optional (default: “target_gene”): Column name in adata.obs indicating the perturbed gene or condition.
skip_raw_counts_checkbool, optional (default: False): If True, skips the heuristic check for log-normalization and always applies

library-size normalization (sc.pp.normalize_total) followed by log1p.

Use this only when you are certain the input contains raw, unnormalized counts.

If False (default), the function checks whether the data appears to be already

log-normalized (e.g., via log1p of normalized counts). If so, normalization and

log-transformation are skipped; otherwise, they are applied.

Returns:

AnnData: Cleaned and normalized AnnData, with NaN-labeled cells removed and expression log-transformed.

select_topk_ccres(adata: AnnData, topk_ccres: int = 50000) → AnnData[source]

Select the top-k cCREs (peaks) with highest total signal across all cells.

The selection is based on the column-wise sum of adata.X. A stable sort is used to ensure reproducibility when ties occur in peak sums.

Parameters:

adataAnnData: Input AnnData object containing cCRE-level features in X.
topk_ccresint, optional (default: 50000): Number of top cCREs to retain.

Returns:

AnnData: A subset of the input adata containing only the top topk_ccres cCREs, in descending order of total signal.

NOTE: This selection procedure is identical to that used in EpiAgent, with the addition of a stable sort to ensure that the top-k cCREs are always deterministically selected.