Examples
This guide provides comprehensive examples for common use cases with TSCFEval, from generating counterfactuals to running benchmarks and analyzing results.
Generating Counterfactuals
TSCFEval provides 7 built-in counterfactual methods covering different generation strategies: instance-based (NativeGuide, COMTE), evolutionary (TSEvo), gradient-based (Glacier, LatentCF), saliency-based (CELS), and shapelet-based (SETS).
All methods follow a unified interface:
Initialize with a fitted classifier and training data tuple
(X_train, y_train)Call
explain(x)to generate a counterfactual for instancexReturns a tuple
(cf, cf_label, meta)containing the counterfactual, its predicted label, and method-specific metadata
Using NativeGuide
NativeGuide is an instance-based method that generates counterfactuals by guiding the original instance toward its nearest unlike neighbor (NUN) - the closest training instance with a different predicted class. It supports four blending strategies:
blend: Linear interpolation toward NUN until prediction flipsng: Native Guide with weighted averagingdtw_dba: DTW Barycentric Averaging for time-series-aware blendingcam: Class Activation Map weighted guidance
from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import UCRLoader, NativeGuide
# Load data and train classifier
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)
# Create explainer (methods: "blend", "ng", "dtw_dba", "cam")
explainer = NativeGuide(clf, (train.X, train.y), method="blend")
# Generate counterfactual for a single instance
x = test.X[0]
cf, cf_label, meta = explainer.explain(x)
print(f"Original prediction: {clf.predict(x.reshape(1, -1))[0]}")
print(f"Counterfactual prediction: {cf_label}")
Using COMTE
COMTE (Counterfactual Multivariate Time-series Explanations) generates counterfactuals by greedily substituting channels from a “distractor” series - a training instance from a different class. It iteratively replaces channels until the prediction flips, producing sparse, interpretable explanations that highlight which channels are most important for the classification decision. Works with both univariate and multivariate time series, using Euclidean or DTW distance for distractor selection:
from tscf_eval import UCRLoader, COMTE
explainer = COMTE(clf, (train.X, train.y), distance="dtw")
cf, cf_label, meta = explainer.explain(test.X[0])
Using TSEvo
TSEvo uses multi-objective evolutionary optimization (NSGA-II) to generate counterfactuals that balance validity, proximity, and plausibility. It applies mutation operators to evolve a population of candidate counterfactuals over multiple generations. Three transformer types control how mutations are applied:
authentic: Mutations based on authentic patterns from training datafrequency: Frequency-domain perturbationsgaussian: Random Gaussian noise perturbations
from tscf_eval import UCRLoader, TSEvo
# Transformers: "authentic", "frequency", "gaussian"
explainer = TSEvo(clf, (train.X, train.y), transformer="authentic")
cf, cf_label, meta = explainer.explain(test.X[0])
Using Glacier
Glacier (Guided Locally Constrained Counterfactual Explanations) uses gradient-based
optimization with importance-weighted proximity constraints. It optimizes in the input
space while penalizing changes to important time points more heavily. Requires a
differentiable classifier (e.g., neural networks). The weight_type parameter
controls how importance weights are computed:
uniform: Equal weight for all time pointslocal: Weights based on local gradients (instance-specific)global: Weights based on global feature importance
from tscf_eval import UCRLoader, Glacier
# Weight types: "uniform", "local", "global"
explainer = Glacier(clf, (train.X, train.y), weight_type="uniform")
cf, cf_label, meta = explainer.explain(test.X[0])
Using SETS and CELS
SETS and CELS use different strategies for identifying discriminative regions:
SETS (Shapelet-based Explanations for Time Series): Identifies class-discriminative shapelets and generates counterfactuals by manipulating these subsequences. Produces contiguous, localized perturbations that are often more interpretable.
CELS (Counterfactual Explanations via Learned Saliency): Uses learned saliency maps to identify important time points, then blends the original instance with its nearest unlike neighbor weighted by the saliency scores. Produces smooth counterfactuals that focus changes on the most discriminative regions.
from tscf_eval import UCRLoader, SETS, CELS
# SETS: Shapelet-based explanations
explainer_sets = SETS(clf, (train.X, train.y))
cf, cf_label, meta = explainer_sets.explain(test.X[0])
# CELS: Saliency map blending
explainer_cels = CELS(clf, (train.X, train.y))
cf, cf_label, meta = explainer_cels.explain(test.X[0])
Evaluating Counterfactuals
TSCFEval provides 11 metrics across 6 quality dimensions for comprehensive counterfactual evaluation:
Core Quality: Validity, Proximity, Sparsity
Distribution Alignment: Plausibility, Diversity
Structural Properties: Contiguity, Composition
Model Behavior: Confidence, Controllability
Stability: Robustness
Performance: Efficiency
The Evaluator class provides a flexible interface for computing any combination
of these metrics. Each metric has specific requirements (e.g., some need the model,
others need training data) which are detailed in the API reference.
Basic Evaluation
The core metrics (Validity, Proximity, Sparsity) measure fundamental counterfactual quality. Validity checks if the prediction changed, Proximity measures how close the counterfactual is to the original, and Sparsity quantifies the fraction of changed features:
from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import UCRLoader, NativeGuide
from tscf_eval.evaluator import Evaluator, Validity, Proximity, Sparsity
# Load data and train classifier
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)
# Generate counterfactuals
explainer = NativeGuide(clf, (train.X, train.y), method="blend")
X, X_cf, y, y_cf = [], [], [], []
for x in test.X[:10]:
cf, cf_label, _ = explainer.explain(x)
X.append(x)
X_cf.append(cf)
y.append(clf.predict(x.reshape(1, -1))[0])
y_cf.append(cf_label)
# Create evaluator
evaluator = Evaluator([
Validity(),
Proximity(p=2, distance="lp"),
Proximity(distance="dtw"),
Sparsity(),
])
# Evaluate
results = evaluator.evaluate(X, X_cf, y=y, y_cf=y_cf)
print(f"Validity: {results['validity_soft']:.2%}")
print(f"Proximity (L2): {results['proximity_l2']:.4f}")
print(f"Proximity (DTW): {results['proximity_dtw']:.4f}")
print(f"Sparsity: {results['sparsity']:.2%}")
Using Model-Dependent Metrics
Some metrics require access to the classifier to compute their values:
Validity: When labels aren’t provided, predictions are inferred from the model
Controllability: Measures how easily the counterfactual changes can be reverted by modifying a single feature (requires making predictions on modified instances)
Confidence: Reports the model’s predicted probabilities for both the original instance and the counterfactual (requires
predict_proba)
from tscf_eval.evaluator import (
Evaluator, Validity, Proximity, Sparsity,
Controllability, Confidence
)
evaluator = Evaluator([
Validity(),
Proximity(distance="dtw"),
Sparsity(),
Controllability(),
Confidence(),
])
# Pass the model to evaluate()
results = evaluator.evaluate(X, X_cf, model=clf, X_train=train.X)
print(f"Validity: {results['validity_soft']:.2%}")
print(f"Controllability: {results['controllability']:.4f}")
print(f"Mean CF confidence: {results['mean_conf_cf']:.4f}")
Using Distribution Metrics
Distribution metrics assess whether counterfactuals are realistic and diverse:
Plausibility: Measures whether counterfactuals lie within the training data distribution using outlier detection. High plausibility means the counterfactual resembles real training instances. Methods include LOF (Local Outlier Factor), Isolation Forest, and DTW-based LOF for time-series-aware detection.
Diversity: When generating multiple counterfactuals per instance, measures the variety among them using Determinantal Point Processes (DPP). Higher diversity means the counterfactuals explore different regions of the feature space.
Both metrics require X_train to be passed to evaluate():
from tscf_eval.evaluator import (
Evaluator, Plausibility, Diversity, Contiguity
)
evaluator = Evaluator([
Plausibility(method="lof"), # Local Outlier Factor
Plausibility(method="dtw_lof"), # DTW-based LOF
Diversity(distance="euclidean"),
Diversity(distance="dtw"),
Contiguity(),
])
# Pass X_train for distribution metrics
results = evaluator.evaluate(X, X_cf, y=y, y_cf=y_cf, X_train=train.X)
Measuring Efficiency
The Efficiency metric tracks how long it takes to generate each counterfactual. This is important for comparing methods in practical applications where generation time matters. You must measure the time yourself and pass it to the evaluator:
import time
from tscf_eval import TSEvo
from tscf_eval.evaluator import Evaluator, Validity, Proximity, Efficiency
explainer = TSEvo(clf, (train.X, train.y), transformer="authentic")
X, X_cf, times = [], [], []
for x in test.X[:5]:
start = time.perf_counter()
cf, _, _ = explainer.explain(x)
times.append(time.perf_counter() - start)
X.append(x)
X_cf.append(cf)
evaluator = Evaluator([Validity(), Proximity(distance="dtw"), Efficiency()])
results = evaluator.evaluate(X, X_cf, model=clf, time_per_instance=times)
print(f"Mean time: {results['efficiency_time_s']:.4f}s")
Full Evaluation with All Metrics
For comprehensive evaluation, you can use all available metrics together. Note that
this requires providing all optional parameters (model, X_train, y, y_cf,
time_per_instance) to satisfy each metric’s requirements:
import time
from tscf_eval import UCRLoader, Glacier
from tscf_eval.evaluator import (
Evaluator, Validity, Proximity, Sparsity,
Plausibility, Diversity, Controllability, Confidence,
Composition, Contiguity, Robustness, Efficiency
)
evaluator = Evaluator([
# Core
Validity(),
Proximity(p=2, distance="lp"),
Proximity(distance="dtw"),
Sparsity(),
# Distribution
Plausibility(method="lof"),
Plausibility(method="dtw_lof"),
Diversity(distance="dtw"),
# Model behavior
Controllability(),
Confidence(),
# Structure
Composition(),
Contiguity(),
# Stability and performance
Robustness(distance="dtw"),
Efficiency(),
])
results = evaluator.evaluate(
X, X_cf,
model=clf,
X_train=train.X,
y=y,
y_cf=y_cf,
time_per_instance=times,
)
Running Benchmarks
The BenchmarkRunner class provides a structured framework for systematically
comparing counterfactual methods. It handles:
Instance selection: Random or confidence-stratified sampling of test instances
Parallel execution: Run multiple explainers in parallel with
n_jobsProgress tracking: Built-in progress bars with tqdm
Result aggregation: Aggregate results by explainer, dataset, or model
TSCFEval supports three benchmarking scenarios:
Single dataset, multiple CF methods: Compare explainer algorithms on a fixed dataset
Single dataset, multiple classifiers: Study how the classifier affects CF quality
Multiple datasets, fixed classifier: Assess generalization across datasets
Single-Dataset Benchmark
The most common scenario: compare multiple counterfactual methods on a single dataset
with a fixed classifier. Use instance_selection="stratified_confidence" to ensure
coverage of both high-confidence and uncertain instances near the decision boundary:
from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import Evaluator, Validity, Proximity, Sparsity
from tscf_eval.benchmark import (
BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig,
)
from tscf_eval.counterfactuals import COMTE, NativeGuide, Glacier
from tscf_eval.data_loader import UCRLoader
# Load data
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
# Train classifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)
# Configure explainers
explainer_configs = [
ExplainerConfig("comte", COMTE, {"distance": "dtw"}),
ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
ExplainerConfig("glacier", Glacier, {"weight_type": "uniform"}),
]
# Configure evaluator
evaluator = Evaluator([
Validity(),
Proximity(distance="dtw"),
Sparsity(),
])
# Run benchmark
runner = BenchmarkRunner(
datasets=[DatasetConfig("ItalyPowerDemand", train.X, train.y, test.X, test.y)],
models=[ModelConfig("knn", clf)],
explainers=explainer_configs,
evaluator=evaluator,
n_instances=20,
instance_selection="stratified_confidence",
verbose=True,
)
results = runner.run()
# View results
print(results.to_dataframe())
print(results.aggregate(by="explainer"))
Multi-Dataset Benchmark
To assess how well counterfactual methods generalize, run benchmarks across multiple datasets. This enables statistical testing (e.g., Friedman test) to determine if performance differences are significant across problem domains:
from tscf_eval.benchmark import (
BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig,
)
from tscf_eval.counterfactuals import COMTE, NativeGuide
from tscf_eval.data_loader import UCRLoader
# Load datasets and train models
dataset_names = ["ItalyPowerDemand", "GunPoint", "ECG200"]
datasets, model_configs = [], []
for name in dataset_names:
loader = UCRLoader(name)
train, test = loader.load("train"), loader.load("test")
datasets.append(DatasetConfig(name, train.X, train.y, test.X, test.y))
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)
model_configs.append(ModelConfig("knn", clf))
# Run benchmark
runner = BenchmarkRunner(
datasets=datasets,
models=model_configs,
explainers=[
ExplainerConfig("comte", COMTE, {"distance": "dtw"}),
ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
],
n_instances=10,
n_jobs=-1, # Parallel execution
verbose=True,
)
results = runner.run()
# Aggregate across datasets
print(results.aggregate(by="explainer"))
Analyzing Results
Counterfactual evaluation is inherently multi-objective: high validity may come at the cost of low proximity, and sparse explanations may sacrifice plausibility. TSCFEval provides tools for principled multi-criteria analysis.
Pareto Analysis
Pareto analysis identifies methods that are not dominated by any other method on the selected metrics. A method is Pareto-optimal if no other method is better on all metrics simultaneously. This avoids the need to specify metric weights upfront:
from tscf_eval.benchmark import ParetoAnalyzer
analyzer = ParetoAnalyzer(metrics=[
"validity_soft", "proximity_dtw", "sparsity",
])
# Find non-dominated methods
pareto_methods = analyzer.pareto_front(results)
print(f"Pareto-optimal: {pareto_methods}")
# Full ranking table
print(analyzer.dominance_ranking(results))
# Export to LaTeX
latex = analyzer.to_latex(results, caption="Results", label="tab:results")
Visualizing Pareto Fronts
Pareto front visualizations help understand the trade-offs between metrics. The 2D plot shows which methods lie on the Pareto front (non-dominated solutions) for any pair of metrics. Consistency heatmaps show how often each method appears on the Pareto front across different datasets:
import matplotlib.pyplot as plt
# 2D Pareto front plot
ax = analyzer.plot_front(
results,
x_metric="proximity_dtw",
y_metric="validity_soft",
annotate=True,
)
plt.savefig("pareto_front.png")
# Cross-dataset consistency heatmap
results_by_dataset = {
ds: results.filter(datasets=[ds])
for ds in results.datasets
}
consistency_df = analyzer.consistency(results_by_dataset)
analyzer.plot_consistency_heatmap(consistency_df)
plt.savefig("consistency.png")
Weighted Scalarization
When you need a single ranking of methods, weighted scalarization combines metrics into a composite score. Each metric is min-max normalized to [0, 1] with direction awareness (maximize metrics are higher-is-better, minimize metrics are inverted), then combined via weighted sum. This enables customizable rankings based on your priorities:
from tscf_eval.benchmark import WeightedScalarizer
# Equal weights
scalarizer = WeightedScalarizer(metrics=[
"validity_soft", "proximity_dtw", "sparsity",
])
print(scalarizer.score(results))
# Custom weights
scalarizer = WeightedScalarizer(
metrics=["validity_soft", "proximity_dtw", "sparsity"],
weights={"validity_soft": 3.0, "proximity_dtw": 1.0, "sparsity": 1.0},
)
# Sensitivity analysis
sens_df = scalarizer.sensitivity(results, vary_metric="validity_soft", n_steps=11)
scalarizer.plot_sensitivity(sens_df)
Statistical Testing
When benchmarking across multiple datasets, the Friedman test determines if there are statistically significant differences between methods. It’s a non-parametric alternative to repeated-measures ANOVA, ranking methods within each dataset and testing if the average ranks differ significantly:
from tscf_eval.benchmark import friedman_test
fr = friedman_test(results, metric="validity_soft")
print(f"Statistic: {fr.statistic:.3f}, p-value: {fr.p_value:.4f}")
print(fr.rankings)
Extending TSCFEval
TSCFEval is designed to be extensible. You can add your own counterfactual methods and evaluation metrics that integrate seamlessly with the benchmarking framework.
Custom Counterfactual Method
To add a new counterfactual method, inherit from the Counterfactual base class
and implement the explain method. The method receives a single instance x
and returns a tuple (cf, cf_label, meta):
cf: The generated counterfactual (same shape as input)cf_label: The predicted class label for the counterfactualmeta: A dictionary with method-specific metadata (e.g., generation parameters)
Here’s an example of a simple interpolation-based method:
import numpy as np
from tscf_eval.counterfactuals import Counterfactual
class MyCounterfactual(Counterfactual):
"""Custom counterfactual using nearest unlike neighbor interpolation."""
def __init__(self, model, data, n_steps=50):
self.model = model
self.X_train, self.y_train = data
self.n_steps = n_steps
def explain(self, x, y_pred=None):
x = np.asarray(x).squeeze()
if y_pred is None:
y_pred = int(self.model.predict(x.reshape(1, -1))[0])
# Find nearest unlike neighbor
preds = self.model.predict(self.X_train)
unlike_mask = preds != y_pred
unlike_samples = self.X_train[unlike_mask]
distances = np.linalg.norm(
unlike_samples.reshape(len(unlike_samples), -1) - x.flatten(),
axis=1
)
target = unlike_samples[np.argmin(distances)]
# Interpolate toward target until prediction flips
cf = x.copy()
for i in range(self.n_steps):
alpha = (i + 1) / self.n_steps
cf = (1 - alpha) * x + alpha * target.squeeze()
cf_label = int(self.model.predict(cf.reshape(1, -1))[0])
if cf_label != y_pred:
break
meta = {"method": "my_cf", "steps": i + 1, "alpha": alpha}
return cf, cf_label, meta
# Use in benchmarks
from tscf_eval.benchmark import ExplainerConfig
config = ExplainerConfig("my_method", MyCounterfactual, {"n_steps": 50})
Custom Evaluation Metric
To add a new evaluation metric, inherit from the Metric base class and implement:
name(): Returns the metric key used in results dictionariescompute(X, X_cf, **kwargs): Computes and returns the metric value
The compute method receives the original instances X, counterfactuals X_cf,
and any additional keyword arguments passed to evaluate() (e.g., model,
X_train, y, y_cf). Here’s an example metric that measures the maximum
per-instance change:
import numpy as np
from tscf_eval.evaluator import Metric
class MaxChangeMetric(Metric):
"""Fraction of instances where max change exceeds threshold."""
def __init__(self, threshold=0.1):
self.threshold = threshold
def name(self):
return f"max_change_t{self.threshold}"
def compute(self, X, X_cf, **kwargs):
diff = np.abs(np.array(X) - np.array(X_cf))
max_changes = np.max(diff.reshape(len(X), -1), axis=1)
return float(np.mean(max_changes > self.threshold))
# Use in evaluator
from tscf_eval.evaluator import Evaluator, Validity
evaluator = Evaluator([
Validity(),
MaxChangeMetric(threshold=0.1),
MaxChangeMetric(threshold=0.5),
])
Complete Workflow
This end-to-end example demonstrates a typical TSCFEval workflow: loading data, training a classifier, running a benchmark, and analyzing results with multiple analysis tools. The results are saved to JSON for later analysis or visualization:
import json
from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import UCRLoader
from tscf_eval.counterfactuals import COMTE, NativeGuide
from tscf_eval.evaluator import Evaluator, Validity, Proximity, Sparsity
from tscf_eval.benchmark import (
BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig,
ParetoAnalyzer, WeightedScalarizer, friedman_test,
)
# 1. Load data
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
# 2. Train classifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(train.X, train.y)
# 3. Run benchmark
runner = BenchmarkRunner(
datasets=[DatasetConfig("ItalyPowerDemand", train.X, train.y, test.X, test.y)],
models=[ModelConfig("knn", clf)],
explainers=[
ExplainerConfig("comte", COMTE, {"distance": "dtw"}),
ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
],
evaluator=Evaluator([Validity(), Proximity(distance="dtw"), Sparsity()]),
n_instances=20,
instance_selection="stratified_confidence",
verbose=True,
)
results = runner.run()
# 4. View results
print(results.to_dataframe())
print(results.aggregate(by="explainer"))
# 5. Pareto analysis
analyzer = ParetoAnalyzer(metrics=["validity_soft", "proximity_dtw", "sparsity"])
print(f"Pareto-optimal: {analyzer.pareto_front(results)}")
print(analyzer.dominance_ranking(results))
# 6. Weighted ranking
scalarizer = WeightedScalarizer(metrics=["validity_soft", "proximity_dtw", "sparsity"])
print(scalarizer.score(results))
# 7. Save results
with open("results.json", "w") as f:
json.dump(results.to_dict(), f, indent=2)