Evaluator Module
The evaluator module provides metrics and orchestration for assessing counterfactual quality.
The metrics implemented follow established counterfactual evaluation literature, with core metrics based on Wachter et al. (2017) and Mothilal et al. (2020).
Evaluator Class
- class tscf_eval.Evaluator[source]
Bases:
objectRun a collection of
Metricinstances over example pairs.The Evaluator orchestrates the computation of multiple metrics over pairs of original instances and their counterfactuals. It handles progress reporting, error handling, and result aggregation.
- Parameters:
metrics (
iterableofMetric) – Collection of metric instances to compute during evaluation.
Examples
>>> from tscf_eval.evaluator import Evaluator, Validity, Proximity, Sparsity >>> import numpy as np >>> >>> # Create evaluator with multiple metrics >>> evaluator = Evaluator([Validity(), Proximity(p=2), Sparsity()]) >>> >>> # Evaluate counterfactuals >>> X = np.random.randn(100, 50) # 100 instances, 50 time points >>> X_cf = X + np.random.randn(100, 50) * 0.1 >>> results = evaluator.evaluate(X, X_cf, y=np.zeros(100), y_cf=np.ones(100)) >>> >>> print(results['validity'], results['proximity_l2'], results['sparsity'])
- evaluate(X, X_cf, **kwargs)[source]
Compute all configured metrics and return a mapping name -> result.
The evaluator forwards all provided
kwargsto each metric’scomputemethod. To avoid silent behavior, if the caller providestime_per_instancethen anEfficiency-style metric must be present (i.e., a metric whosename()returns"efficiency_time_s") that will consume that argument and report a canonical value. This avoids the evaluator guessing at how to aggregate timings.- Parameters:
X (
np.ndarray) – Original instances, shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances, shape matchingX.**kwargs – Forwarded to each metric. Common kwargs include:
model: Classifier for metrics like Validity, Controllability.X_train: Training data for Plausibility, Robustness.y,y_cf: Labels for Validity when model not provided.time_per_instance: Timings for Efficiency metric.
- Return type:
- Returns:
dict– Mapping from metric name to computed result. Also includes'_evaluator_time_s'with total evaluation time.- Raises:
ValueError – If
XandX_cfhave different numbers of instances, or iftime_per_instanceis provided without an Efficiency metric.TypeError – If a metric raises TypeError due to unexpected kwargs.
- Parameters:
Metric Base Class
- class tscf_eval.evaluator.base.Metric[source]
Bases:
ABCAbstract base for a single evaluation metric.
Subclasses must implement
nameandcompute.The
computemethod receives two arrays,XandX_cf(original instances and counterfactuals) and may accept additional keyword arguments such asmodelorX_traindepending on the metric’s needs.- abstractmethod compute(X, X_cf, **kwargs)[source]
Compute the metric.
- Parameters:
X (
np.ndarray) – Original instances, shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances, shape matchingX.**kwargs – Optional metric-specific keyword arguments (e.g.,
model,X_train,y,y_cf).
- Return type:
- Returns:
Any– Metric result (scalar, array, or mapping).- Parameters:
X (np.ndarray)
X_cf (np.ndarray)
Built-in Metrics
Validity
Fraction of counterfactuals that change the model prediction. Based on Li et al. (2023).
- class tscf_eval.Validity[source]
Bases:
MetricFraction of counterfactuals that change the model prediction.
Accepts either a fitted
model(withpredict) or two label arraysyandy_cf. Whenmodelis provided, compares model predictions onXandX_cf; otherwise compares the provided label arrays.- Parameters:
mode (
{"hard", "soft"}, default"hard") – Evaluation mode."hard": Binary indicator — fraction of instances where the predicted label changed."soft": Mean probability shift toward the target class. Computed asP(target_class | X_cf) - P(target_class | X)per instance, clipped to[0, 1]. Requires a model withpredict_proba. Falls back to hard validity when only label arrays are provided.
See Li et al. (2023) for details.
- __init__(mode='soft')[source]
Initialize the Validity metric.
- Parameters:
mode (
{"hard", "soft"}, default"soft") – Evaluation mode."hard"uses binary label change;"soft"uses probability shift toward the target class.- Parameters:
mode (Literal['hard', 'soft'])
- compute(X, X_cf, model=None, y=None, y_cf=None, **kwargs)[source]
Compute validity score.
- Parameters:
X (
np.ndarray) – Original instances.X_cf (
np.ndarray) – Counterfactual instances.model (
object, optional) – Classifier withpredictmethod (andpredict_probafor soft mode).y (
array-like, optional) – Original labels (used ifmodelisNone).y_cf (
array-like, optional) – Counterfactual labels (used ifmodelisNone).**kwargs – Additional keyword arguments. Recognized internal keys:
_cached_y_pred,_cached_y_cf_pred: Pre-computed hard predictions from the Evaluator._cached_proba_X,_cached_proba_X_cf: Pre-computed class probabilities from the Evaluator (used in soft mode).
- Return type:
- Returns:
float– For hard mode: fraction of instances where the label changed, in[0, 1]. For soft mode: mean probability shift toward the target class, in[0, 1].- Raises:
ValueError – If neither
modelnor(y, y_cf)are provided, or if soft mode is requested but the model lackspredict_proba.- Parameters:
Proximity
Proximity score between original and counterfactual instances, computed as
1 / (1 + d) where d is the L-p distance. Higher values indicate
counterfactuals closer to the originals.
Based on Delaney et al. (2021) and Bahri et al. (2022).
- class tscf_eval.Proximity[source]
Bases:
MetricProximity score between original and counterfactual instances.
Computed as
1 / (1 + d)wheredis the per-instance distance. Values are in[0, 1]where 1 means identical and higher is better.- Parameters:
p (
intorfloat, default2) – Norm order (1 for L1, 2 for L2,np.inforfloat('inf')for Linf). Only used whendistance="lp".distance (
{"lp", "dtw"}, default"dtw") – Distance function to use."lp": L-p norm distance (controlled byp)."dtw": Dynamic Time Warping distance (per-channel, averaged). Requirestslearn; falls back to Euclidean if unavailable.
See Delaney et al. (2021) and Bahri et al. (2022) for details.
- compute(X, X_cf, **kwargs)[source]
Compute mean proximity score across instances.
The score is
1 / (1 + d)wheredis the distance, averaged over all instances.- Parameters:
X (
np.ndarray) – Original instances.X_cf (
np.ndarray) – Counterfactual instances.**kwargs – Ignored.
- Return type:
- Returns:
float– Mean proximity score in[0, 1]. Higher values indicate counterfactuals closer to the originals.- Raises:
ValueError – If
XandX_cfhave different numbers of instances, or ifdistanceis not a supported value.- Parameters:
Sparsity
Fraction of features/time-points changed between original and counterfactual. Lower values indicate sparser (more targeted) edits. Based on Mothilal et al. (2020).
- class tscf_eval.Sparsity[source]
Bases:
MetricFraction of features/time-points changed between original and counterfactual.
Flattens per-instance arrays and reports the mean fraction of entries that differ between
XandX_cf. Lower values indicate sparser (more targeted) edits.- Parameters:
Notes
A feature is considered unchanged if
|X[i] - X_cf[i]| <= atol + rtol * |X[i]|. This avoids false positives from floating-point precision issues.See Mothilal et al. (2020) for details.
- compute(X, X_cf, **kwargs)[source]
Compute mean sparsity across instances.
- Parameters:
X (
np.ndarray) – Original instances.X_cf (
np.ndarray) – Counterfactual instances.**kwargs – Ignored.
- Return type:
- Returns:
float– Mean fraction of changed features in[0, 1]. Lower values indicate sparser (more targeted) edits.- Raises:
ValueError – If
XandX_cfhave different numbers of instances.- Parameters:
Plausibility
Whether counterfactuals lie within the training data distribution, scored via outlier detection. Supports three backends: LOF (Breunig et al., 2000), Isolation Forest (Liu et al., 2008), and Matrix Profile + OneClassSVM (Yeh et al., 2016).
- class tscf_eval.Plausibility[source]
Bases:
MetricPlausibility scored via an outlier detector.
Evaluates whether counterfactuals lie within the training data distribution using outlier detection methods.
- Parameters:
method (
{'lof', 'if', 'mp_ocsvm', 'dtw_lof'}, default'dtw_lof') – Detector backend:'lof': LocalOutlierFactor in novelty mode (Breunig et al., 2000).'if': IsolationForest (Liu et al., 2008).'mp_ocsvm': Matrix Profile features (Yeh et al., 2016) with OneClassSVM.'dtw_lof': LOF with DTW distance (precomputed distance matrix). Usestslearnfor DTW; falls back to Euclidean if unavailable. More appropriate for time series as it respects temporal alignment.
**kwargs – Additional arguments passed to the detector.
Notes
When optional packages (e.g.,
stumpy) are unavailable, the implementation falls back to safe alternatives.- __init__(method='dtw_lof', **kwargs)[source]
Initialize the Plausibility metric.
- Parameters:
method (
{"lof", "if", "mp_ocsvm", "dtw_lof"}, default"dtw_lof") – Outlier detection backend to use.**kwargs – Additional arguments passed to the underlying detector.
- Parameters:
method (Literal['lof', 'if', 'mp_ocsvm', 'dtw_lof'])
- clear_cache()[source]
Clear cached fitted detectors and matrix profile features to free memory.
- Return type:
- compute(X, X_cf, X_train=None, **kwargs)[source]
Compute plausibility score.
- Parameters:
X (
np.ndarray) – Original instances.X_cf (
np.ndarray) – Counterfactual instances.X_train (
np.ndarray, optional) – Training data for fitting the detector. IfNone, usesX.**kwargs – Ignored.
- Return type:
- Returns:
float– Fraction of counterfactuals classified as inliers, in[0, 1].- Parameters:
Diversity
Diversity among multiple counterfactuals for the same query, using a DPP-inspired log-determinant measure. Based on Mothilal et al. (2020) and Kulesza & Taskar (2012).
- class tscf_eval.Diversity[source]
Bases:
MetricDiversity of multiple counterfactuals using DPP-inspired log-determinant.
Measures diversity among multiple counterfactuals for the same query. Higher values indicate more diverse counterfactuals.
- Parameters:
distance (
{"euclidean", "dtw"}, default"dtw") – Distance function used to build the pairwise distance matrix between counterfactuals for each query."euclidean": Euclidean distance on flattened vectors."dtw": Per-channel DTW distance (averaged across channels). Requirestslearn; falls back to Euclidean if unavailable.
Notes
Expects
X_cfwith shape(M, K, ...)whereKis the number of counterfactuals per query.See Mothilal et al. (2020) and Kulesza & Taskar (2012) for details.
- __init__(distance='dtw')[source]
Initialize the Diversity metric.
- Parameters:
distance (
{"euclidean", "dtw"}, default"dtw") – Distance function for building pairwise distance matrices.- Parameters:
distance (Literal['euclidean', 'dtw'])
- compute(X, X_cf, max_components=50, **kwargs)[source]
Compute diversity score.
- Parameters:
X (
np.ndarray) – Original instances.X_cf (
np.ndarray) – Counterfactual instances of shape(M, K, ...)whereKis the number of counterfactuals per query.max_components (
int, default50) – Maximum number of components for randomized SVD approximation.**kwargs – May contain
_X_cf_allwith full counterfactuals when the benchmark passes first-CF-only asX_cffor other metrics.
- Return type:
- Returns:
float– Diversity score (higher = more diverse). Returnsnp.nanifX_cfhas fewer than 3 dimensions (single CF per query).- Raises:
ValueError – If
distanceis not a supported value.- Parameters:
Controllability
Ease of reverting counterfactual changes via single-feature edits. Based on Verma et al. (2024).
- class tscf_eval.Controllability[source]
Bases:
MetricHow easily a counterfactual can be reverted by partial controlled edits.
For each counterfactual, this metric reverts random subsets of changed features at several fraction levels and checks whether the original prediction is restored. The score is the fraction of revert attempts that succeed, averaged across fractions, samples, and instances.
- Parameters:
revert_fractions (
listoffloat, optional) – Fractions of changed features to revert at each probe level. Default is[0.1, 0.2, 0.3, 0.4, 0.5].n_samples (
int, optional) – Number of random subsets to draw per fraction per instance. Default is10.random_state (
intorNone, optional) – Seed for reproducibility. Default isNone.See Verma et al. (2024) for details.
- __init__(revert_fractions=None, n_samples=10, random_state=None)[source]
Initialize the Controllability metric.
- Parameters:
revert_fractions (
listoffloat, optional) – Fractions of changed features to revert at each probe level. Default is[0.1, 0.2, 0.3, 0.4, 0.5].n_samples (
int, default10) – Number of random subsets to draw per fraction per instance.random_state (
intorNone, defaultNone) – Seed for reproducibility.
- Parameters:
- compute(X, X_cf, model=None, **kwargs)[source]
Compute controllability score via random subset reverts.
For each instance the method identifies which features changed, then for every fraction in
revert_fractionsit drawsn_samplesrandom subsets of that size from the changed features, reverts them to their original values, and checks whether the model prediction is restored.- Parameters:
X (
np.ndarray) – Original instances of shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances of shape(M, ...).model (
object) – Classifier with apredictmethod.**kwargs – Additional keyword arguments (unused).
- Return type:
- Returns:
float– Mean controllability score in[0, 1]. Higher values indicate that counterfactuals can be more easily reverted.- Raises:
ValueError – If
modelisNone.- Parameters:
Confidence
Model confidence (maximum predicted probability) statistics for original and counterfactual instances. Based on Le et al. (2023).
- class tscf_eval.Confidence[source]
Bases:
MetricConfidence summaries (maximum predicted probability) for instances.
Reports the mean maximum predicted probability for both original and counterfactual instances, as well as the mean difference.
See Le et al. (2023) for details.
- compute(X, X_cf, model=None, **kwargs)[source]
Compute confidence statistics.
- Parameters:
X (
np.ndarray) – Original instances of shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances of shape(M, ...).model (
object) – Classifier with apredict_probamethod.**kwargs – Additional keyword arguments (unused).
- Return type:
- Returns:
dict– Dictionary with keys:mean_conf_orig: Mean max probability for original instances.mean_conf_cf: Mean max probability for counterfactuals.mean_conf_delta: Mean difference (cf - orig).
- Raises:
ValueError – If
modelisNone.- Parameters:
Composition
Segment-based statistics measuring contiguous runs of edits, relevant for time series interpretability. Based on Delaney et al. (2021) and Ates et al. (2021).
- class tscf_eval.Composition[source]
Bases:
MetricSegment-based composition statistics measuring runs of edits.
Analyzes the structure of edits by counting contiguous segments (runs) of changed values and their lengths. In addition to the mean number of segments and the mean segment length, it reports a composition score:
composition = mean_segment_length / mean_n_segments.This favors counterfactuals with fewer, longer contiguous edits over those with many short, scattered modifications.
See Delaney et al. (2021) and Ates et al. (2021) for details.
- compute(X, X_cf, **kwargs)[source]
Compute composition statistics.
- Parameters:
X (
np.ndarray) – Original instances of shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances of shape(M, ...).**kwargs – Additional keyword arguments (unused).
- Return type:
- Returns:
dict– Dictionary with keys:mean_n_segments: Mean number of contiguous edit segments.mean_avg_segment_len: Mean average length of segments.composition: Ratiomean_avg_segment_len / mean_n_segments.
- Parameters:
Contiguity
Scalar measure of how concentrated edits are in a single block. Based on Delaney et al. (2021) and Ates et al. (2021).
- class tscf_eval.Contiguity[source]
Bases:
MetricMeasure how contiguous edits are (fewer runs = higher contiguity).
Produces a scalar in
[0, 1]where 1 indicates fully contiguous edits (all changes occur in a single uninterrupted segment).See Delaney et al. (2021) and Ates et al. (2021) for details.
- compute(X, X_cf, **kwargs)[source]
Compute contiguity score.
- Parameters:
X (
np.ndarray) – Original instances of shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances of shape(M, ...).**kwargs – Additional keyword arguments (unused).
- Return type:
- Returns:
float– Mean contiguity score in[0, 1]. Higher values indicate more contiguous edits.- Parameters:
Robustness
Local Lipschitz-like stability estimate using k-nearest neighbor analysis. Based on Ates et al. (2021).
- class tscf_eval.Robustness[source]
Bases:
MetricLocal Lipschitz-like robustness estimate using k-nearest neighbors.
This metric estimates how sensitive counterfactuals are relative to the original inputs by scanning local neighbor pairs. For each instance i and its k nearest neighbors j (excluding i) it computes the ratio
d(x_cf[i], x_cf[j]) / d(x[i], x[j]). Larger values indicate that small perturbations in the original space can produce larger changes in the counterfactual space, i.e., lower local robustness.- Parameters:
k (
int, optional) – Number of neighbors to consider. Default is 5. If the dataset has fewer instances thank + 1, the number of neighbors is reduced accordingly.distance (
{"euclidean", "dtw"}, default"dtw") – Distance function to use."euclidean": Euclidean distance on flattened vectors."dtw": Per-channel DTW distance (averaged across channels). Requirestslearn; falls back to Euclidean if unavailable.
See Ates et al. (2021) for details.
- compute(X, X_cf, X_train=None, **kwargs)[source]
Compute robustness score.
- Parameters:
X (
np.ndarray) – Original instances of shape(M, ...).X_cf (
np.ndarray) – Counterfactual instances of shape(M, ...).X_train (
np.ndarray, optional) – Training data (unused, present for API compatibility).**kwargs – Additional keyword arguments (unused).
- Return type:
- Returns:
float– 95th-percentile neighbor ratio (>= 0). Returns 0.0 when there are not enough instances to form neighbor pairs.- Parameters:
Efficiency
Mean per-instance generation time. Based on Li et al. (2023).
- class tscf_eval.Efficiency[source]
Bases:
MetricMean per-instance timing summary.
This metric reports the average elapsed time per instance (in seconds) as provided by the caller via the
time_per_instanceiterable. The metric itself does not inspectXorX_cfbut follows theMetric.computesignature for compatibility withEvaluator.See Li et al. (2023) for details.
- compute(X, X_cf, time_per_instance=None, **kwargs)[source]
Compute mean elapsed time per instance.
- Parameters:
X (
np.ndarray) – Original instances. Present for API compatibility but not used.X_cf (
np.ndarray) – Counterfactual instances. Present for API compatibility but not used.time_per_instance (
iterableoffloat, optional) – Iterable of elapsed times (seconds) for each produced counterfactual instance. Can be a list, generator, or any iterable. If omitted, the metric returns 0.0.**kwargs – Additional keyword arguments (unused).
- Return type:
- Returns:
float– Mean elapsed time per instance in seconds, or 0.0 when no timings are provided.- Parameters:
X (np.ndarray)
X_cf (np.ndarray)
time_per_instance (Iterable[float] | None)
References
The metrics in this module are based on the following works:
Validity
Li, P., Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2023). “CELS: Counterfactual Explanations for Time Series Data via Learned Saliency Maps.” In 2023 IEEE International Conference on Big Data (BigData), pp. 718-727. [Paper]
Proximity
Delaney, E., Greene, D., & Keane, M. T. (2021). “Instance-Based Counterfactual Explanations for Time Series Classification.” arXiv:2009.13211. [Paper]
Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2022). “Shapelet-Based Counterfactual Explanations for Multivariate Time Series.”
Sparsity
Mothilal, R. K., Sharma, A., & Tan, C. (2020). “Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations.” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT ‘20)*, pp. 607-617. [Paper]
Plausibility
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). “LOF: Identifying Density-Based Local Outliers.” In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93-104. [Paper]
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). “Isolation Forest.” In 2008 Eighth IEEE International Conference on Data Mining, pp. 413-422. [Paper]
Yeh, C.-C. M., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H. A., Silva, D. F., Mueen, A., & Keogh, E. (2016). “Matrix Profile I: All Pairs Similarity Joins for Time Series.” In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1317-1322. [Paper]
Diversity
Mothilal, R. K., Sharma, A., & Tan, C. (2020). “Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations.” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT ‘20)*, pp. 607-617. [Paper]
Kulesza, A., & Taskar, B. (2012). “Determinantal Point Processes for Machine Learning.” Foundations and Trends in Machine Learning, 5(2-3), 123-286. [Paper]
Confidence
Le, T., Miller, T., Singh, R., & Sonenberg, L. (2023). “Explaining model confidence using counterfactuals.” In Proceedings of the AAAI Conference on Artificial Intelligence, 37(10), pp. 12101-12109. [Paper]
Controllability
Verma, S., Boonsanong, V., Hoang, M., Hines, K. E., Dickerson, J. P., & Shah, C. (2024). “Counterfactual Explanations for Machine Learning: Challenges Revisited.” ACM Computing Surveys, 56(12), Article 304. [Paper]
Composition, Contiguity
Delaney, E., Greene, D., & Keane, M. T. (2021). “Instance-Based Counterfactual Explanations for Time Series Classification.” In Case-Based Reasoning Research and Development (ICCBR 2021), pp. 32-47. Springer. [Paper]
Ates, E., Aksar, B., Leung, V. J., & Coskun, A. K. (2021). “Counterfactual Explanations for Multivariate Time Series.” In Proceedings of the 2021 International Conference on Applied Artificial Intelligence (ICAPAI), pp. 1-8. [Paper]
Robustness
Ates, E., Aksar, B., Leung, V. J., & Coskun, A. K. (2021). “Counterfactual Explanations for Multivariate Time Series.” In Proceedings of the 2021 International Conference on Applied Artificial Intelligence (ICAPAI), pp. 1-8. [Paper]
Efficiency
Li, P., Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2023). “Attention-Based Counterfactual Explanation for Multivariate Time Series.” In Data Warehousing and Knowledge Discovery (DaWaK 2023), Lecture Notes in Computer Science, vol 14148, pp. 287-293. Springer. [Paper]