mtopic.pp.filter_var_knee

mtopic.pp.filter_var_knee#

mtopic.pp.filter_var_knee(path, model, knee_sensitivity=5)#

Filter overrepresented features from a MuData object using a knee detection algorithm.

This function identifies and removes overrepresented features (e.g., genes, proteins) across all topics in each modality of a MuData object using a knee detection algorithm. Overrepresented features, which are beyond a significant drop-off point (knee point) in their cumulative feature score, are filtered out to improve data quality and downstream analysis.

Parameters:
  • path (str) – The file path to the .h5mu file containing the MuData object to be processed.

  • model (mtopic.tl.MTM or mtopic.tl.sMTM) – An instance of a topic model containing the topic-feature distributions (e.g., lambda_ matrix for each modality).

  • knee_sensitivity (int or dict, optional) – Sensitivity for the knee detection algorithm. Higher values make the algorithm more conservative in identifying overrepresented features. It can be a single integer (global for all modalities) or a dictionary specifying sensitivity per modality. Default is 5.

Returns:

A MuData object with overrepresented features removed.

Return type:

muon.MuData

Raises:
  • FileNotFoundError – If the specified .h5mu file does not exist or is inaccessible.

  • ValueError – If knee_sensitivity is invalid or features cannot be identified for filtering.

Example:
import mtopic

# Load MuData object and model
mdata = mtopic.read.h5mu("path/to/file.h5mu")
model = mtopic.tl.MTM(mdata, n_topics=20)

# Filter overrepresented features
filtered_mdata = mtopic.pp.filter_var_knee("path/to/file.h5mu", model)
Notes:
  • Feature Identification: Overrepresented features are identified by calculating their cumulative feature score across all topics in a modality. The knee detection algorithm (kneed) detects the knee point, beyond which features are considered overrepresented.

  • Knee Sensitivity: The knee_sensitivity parameter can be set globally for all modalities or specified individually for each modality as a dictionary. This allows flexibility based on the characteristics of each modality.

  • Data Consistency: After filtering, the mdata.update() method ensures consistency across the multimodal data structure.

  • Applicability: This approach is ideal for filtering features that dominate topic distributions, which may obscure meaningful patterns.