Examples

This guide provides comprehensive examples for common use cases with TSCFEval, from generating counterfactuals to running benchmarks and analyzing results.

Generating Counterfactuals

TSCFEval provides 7 built-in counterfactual methods covering different generation strategies: instance-based (NativeGuide, COMTE), evolutionary (TSEvo), gradient-based (Glacier, LatentCF), saliency-based (CELS), and shapelet-based (SETS).

All methods follow a unified interface:

  1. Initialize with a fitted classifier and training data tuple (X_train, y_train)

  2. Call explain(x) to generate a counterfactual for instance x

  3. Returns a tuple (cf, cf_label, meta) containing the counterfactual, its predicted label, and method-specific metadata

Using NativeGuide

NativeGuide is an instance-based method that generates counterfactuals by guiding the original instance toward its nearest unlike neighbor (NUN) - the closest training instance with a different predicted class. It supports four blending strategies:

  • blend: Linear interpolation toward NUN until prediction flips

  • ng: Native Guide with weighted averaging

  • dtw_dba: DTW Barycentric Averaging for time-series-aware blending

  • cam: Class Activation Map weighted guidance

from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import UCRLoader, NativeGuide

# Load data and train classifier
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)

# Create explainer (methods: "blend", "ng", "dtw_dba", "cam")
explainer = NativeGuide(clf, (train.X, train.y), method="blend")

# Generate counterfactual for a single instance
x = test.X[0]
cf, cf_label, meta = explainer.explain(x)

print(f"Original prediction: {clf.predict(x.reshape(1, -1))[0]}")
print(f"Counterfactual prediction: {cf_label}")

Using COMTE

COMTE (Counterfactual Multivariate Time-series Explanations) generates counterfactuals by greedily substituting channels from a “distractor” series - a training instance from a different class. It iteratively replaces channels until the prediction flips, producing sparse, interpretable explanations that highlight which channels are most important for the classification decision. Works with both univariate and multivariate time series, using Euclidean or DTW distance for distractor selection:

from tscf_eval import UCRLoader, COMTE

explainer = COMTE(clf, (train.X, train.y), distance="dtw")
cf, cf_label, meta = explainer.explain(test.X[0])

Using TSEvo

TSEvo uses multi-objective evolutionary optimization (NSGA-II) to generate counterfactuals that balance validity, proximity, and plausibility. It applies mutation operators to evolve a population of candidate counterfactuals over multiple generations. Three transformer types control how mutations are applied:

  • authentic: Mutations based on authentic patterns from training data

  • frequency: Frequency-domain perturbations

  • gaussian: Random Gaussian noise perturbations

from tscf_eval import UCRLoader, TSEvo

# Transformers: "authentic", "frequency", "gaussian"
explainer = TSEvo(clf, (train.X, train.y), transformer="authentic")
cf, cf_label, meta = explainer.explain(test.X[0])

Using Glacier

Glacier (Guided Locally Constrained Counterfactual Explanations) uses gradient-based optimization with importance-weighted proximity constraints. It optimizes in the input space while penalizing changes to important time points more heavily. Requires a differentiable classifier (e.g., neural networks). The weight_type parameter controls how importance weights are computed:

  • uniform: Equal weight for all time points

  • local: Weights based on local gradients (instance-specific)

  • global: Weights based on global feature importance

from tscf_eval import UCRLoader, Glacier

# Weight types: "uniform", "local", "global"
explainer = Glacier(clf, (train.X, train.y), weight_type="uniform")
cf, cf_label, meta = explainer.explain(test.X[0])

Using SETS and CELS

SETS and CELS use different strategies for identifying discriminative regions:

  • SETS (Shapelet-based Explanations for Time Series): Identifies class-discriminative shapelets and generates counterfactuals by manipulating these subsequences. Produces contiguous, localized perturbations that are often more interpretable.

  • CELS (Counterfactual Explanations via Learned Saliency): Uses learned saliency maps to identify important time points, then blends the original instance with its nearest unlike neighbor weighted by the saliency scores. Produces smooth counterfactuals that focus changes on the most discriminative regions.

from tscf_eval import UCRLoader, SETS, CELS

# SETS: Shapelet-based explanations
explainer_sets = SETS(clf, (train.X, train.y))
cf, cf_label, meta = explainer_sets.explain(test.X[0])

# CELS: Saliency map blending
explainer_cels = CELS(clf, (train.X, train.y))
cf, cf_label, meta = explainer_cels.explain(test.X[0])

Evaluating Counterfactuals

TSCFEval provides 11 metrics across 6 quality dimensions for comprehensive counterfactual evaluation:

  1. Core Quality: Validity, Proximity, Sparsity

  2. Distribution Alignment: Plausibility, Diversity

  3. Structural Properties: Contiguity, Composition

  4. Model Behavior: Confidence, Controllability

  5. Stability: Robustness

  6. Performance: Efficiency

The Evaluator class provides a flexible interface for computing any combination of these metrics. Each metric has specific requirements (e.g., some need the model, others need training data) which are detailed in the API reference.

Basic Evaluation

The core metrics (Validity, Proximity, Sparsity) measure fundamental counterfactual quality. Validity checks if the prediction changed, Proximity measures how close the counterfactual is to the original, and Sparsity quantifies the fraction of changed features:

from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import UCRLoader, NativeGuide
from tscf_eval.evaluator import Evaluator, Validity, Proximity, Sparsity

# Load data and train classifier
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)

# Generate counterfactuals
explainer = NativeGuide(clf, (train.X, train.y), method="blend")
X, X_cf, y, y_cf = [], [], [], []
for x in test.X[:10]:
    cf, cf_label, _ = explainer.explain(x)
    X.append(x)
    X_cf.append(cf)
    y.append(clf.predict(x.reshape(1, -1))[0])
    y_cf.append(cf_label)

# Create evaluator
evaluator = Evaluator([
    Validity(),
    Proximity(p=2, distance="lp"),
    Proximity(distance="dtw"),
    Sparsity(),
])

# Evaluate
results = evaluator.evaluate(X, X_cf, y=y, y_cf=y_cf)

print(f"Validity: {results['validity_soft']:.2%}")
print(f"Proximity (L2): {results['proximity_l2']:.4f}")
print(f"Proximity (DTW): {results['proximity_dtw']:.4f}")
print(f"Sparsity: {results['sparsity']:.2%}")

Using Model-Dependent Metrics

Some metrics require access to the classifier to compute their values:

  • Validity: When labels aren’t provided, predictions are inferred from the model

  • Controllability: Measures how easily the counterfactual changes can be reverted by modifying a single feature (requires making predictions on modified instances)

  • Confidence: Reports the model’s predicted probabilities for both the original instance and the counterfactual (requires predict_proba)

from tscf_eval.evaluator import (
    Evaluator, Validity, Proximity, Sparsity,
    Controllability, Confidence
)

evaluator = Evaluator([
    Validity(),
    Proximity(distance="dtw"),
    Sparsity(),
    Controllability(),
    Confidence(),
])

# Pass the model to evaluate()
results = evaluator.evaluate(X, X_cf, model=clf, X_train=train.X)

print(f"Validity: {results['validity_soft']:.2%}")
print(f"Controllability: {results['controllability']:.4f}")
print(f"Mean CF confidence: {results['mean_conf_cf']:.4f}")

Using Distribution Metrics

Distribution metrics assess whether counterfactuals are realistic and diverse:

  • Plausibility: Measures whether counterfactuals lie within the training data distribution using outlier detection. High plausibility means the counterfactual resembles real training instances. Methods include LOF (Local Outlier Factor), Isolation Forest, and DTW-based LOF for time-series-aware detection.

  • Diversity: When generating multiple counterfactuals per instance, measures the variety among them using Determinantal Point Processes (DPP). Higher diversity means the counterfactuals explore different regions of the feature space.

Both metrics require X_train to be passed to evaluate():

from tscf_eval.evaluator import (
    Evaluator, Plausibility, Diversity, Contiguity
)

evaluator = Evaluator([
    Plausibility(method="lof"),       # Local Outlier Factor
    Plausibility(method="dtw_lof"),   # DTW-based LOF
    Diversity(distance="euclidean"),
    Diversity(distance="dtw"),
    Contiguity(),
])

# Pass X_train for distribution metrics
results = evaluator.evaluate(X, X_cf, y=y, y_cf=y_cf, X_train=train.X)

Measuring Efficiency

The Efficiency metric tracks how long it takes to generate each counterfactual. This is important for comparing methods in practical applications where generation time matters. You must measure the time yourself and pass it to the evaluator:

import time
from tscf_eval import TSEvo
from tscf_eval.evaluator import Evaluator, Validity, Proximity, Efficiency

explainer = TSEvo(clf, (train.X, train.y), transformer="authentic")
X, X_cf, times = [], [], []
for x in test.X[:5]:
    start = time.perf_counter()
    cf, _, _ = explainer.explain(x)
    times.append(time.perf_counter() - start)
    X.append(x)
    X_cf.append(cf)

evaluator = Evaluator([Validity(), Proximity(distance="dtw"), Efficiency()])
results = evaluator.evaluate(X, X_cf, model=clf, time_per_instance=times)

print(f"Mean time: {results['efficiency_time_s']:.4f}s")

Full Evaluation with All Metrics

For comprehensive evaluation, you can use all available metrics together. Note that this requires providing all optional parameters (model, X_train, y, y_cf, time_per_instance) to satisfy each metric’s requirements:

import time
from tscf_eval import UCRLoader, Glacier
from tscf_eval.evaluator import (
    Evaluator, Validity, Proximity, Sparsity,
    Plausibility, Diversity, Controllability, Confidence,
    Composition, Contiguity, Robustness, Efficiency
)

evaluator = Evaluator([
    # Core
    Validity(),
    Proximity(p=2, distance="lp"),
    Proximity(distance="dtw"),
    Sparsity(),
    # Distribution
    Plausibility(method="lof"),
    Plausibility(method="dtw_lof"),
    Diversity(distance="dtw"),
    # Model behavior
    Controllability(),
    Confidence(),
    # Structure
    Composition(),
    Contiguity(),
    # Stability and performance
    Robustness(distance="dtw"),
    Efficiency(),
])

results = evaluator.evaluate(
    X, X_cf,
    model=clf,
    X_train=train.X,
    y=y,
    y_cf=y_cf,
    time_per_instance=times,
)

Running Benchmarks

The BenchmarkRunner class provides a structured framework for systematically comparing counterfactual methods. It handles:

  • Instance selection: Random or confidence-stratified sampling of test instances

  • Parallel execution: Run multiple explainers in parallel with n_jobs

  • Progress tracking: Built-in progress bars with tqdm

  • Result aggregation: Aggregate results by explainer, dataset, or model

TSCFEval supports three benchmarking scenarios:

  1. Single dataset, multiple CF methods: Compare explainer algorithms on a fixed dataset

  2. Single dataset, multiple classifiers: Study how the classifier affects CF quality

  3. Multiple datasets, fixed classifier: Assess generalization across datasets

Single-Dataset Benchmark

The most common scenario: compare multiple counterfactual methods on a single dataset with a fixed classifier. Use instance_selection="stratified_confidence" to ensure coverage of both high-confidence and uncertain instances near the decision boundary:

from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import Evaluator, Validity, Proximity, Sparsity
from tscf_eval.benchmark import (
    BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig,
)
from tscf_eval.counterfactuals import COMTE, NativeGuide, Glacier
from tscf_eval.data_loader import UCRLoader

# Load data
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")

# Train classifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)

# Configure explainers
explainer_configs = [
    ExplainerConfig("comte", COMTE, {"distance": "dtw"}),
    ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
    ExplainerConfig("glacier", Glacier, {"weight_type": "uniform"}),
]

# Configure evaluator
evaluator = Evaluator([
    Validity(),
    Proximity(distance="dtw"),
    Sparsity(),
])

# Run benchmark
runner = BenchmarkRunner(
    datasets=[DatasetConfig("ItalyPowerDemand", train.X, train.y, test.X, test.y)],
    models=[ModelConfig("knn", clf)],
    explainers=explainer_configs,
    evaluator=evaluator,
    n_instances=20,
    instance_selection="stratified_confidence",
    verbose=True,
)
results = runner.run()

# View results
print(results.to_dataframe())
print(results.aggregate(by="explainer"))

Multi-Dataset Benchmark

To assess how well counterfactual methods generalize, run benchmarks across multiple datasets. This enables statistical testing (e.g., Friedman test) to determine if performance differences are significant across problem domains:

from tscf_eval.benchmark import (
    BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig,
)
from tscf_eval.counterfactuals import COMTE, NativeGuide
from tscf_eval.data_loader import UCRLoader

# Load datasets and train models
dataset_names = ["ItalyPowerDemand", "GunPoint", "ECG200"]
datasets, model_configs = [], []

for name in dataset_names:
    loader = UCRLoader(name)
    train, test = loader.load("train"), loader.load("test")
    datasets.append(DatasetConfig(name, train.X, train.y, test.X, test.y))

    clf = KNeighborsClassifier(n_neighbors=3)
    clf.fit(train.X, train.y)
    model_configs.append(ModelConfig("knn", clf))

# Run benchmark
runner = BenchmarkRunner(
    datasets=datasets,
    models=model_configs,
    explainers=[
        ExplainerConfig("comte", COMTE, {"distance": "dtw"}),
        ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
    ],
    n_instances=10,
    n_jobs=-1,  # Parallel execution
    verbose=True,
)
results = runner.run()

# Aggregate across datasets
print(results.aggregate(by="explainer"))

Analyzing Results

Counterfactual evaluation is inherently multi-objective: high validity may come at the cost of low proximity, and sparse explanations may sacrifice plausibility. TSCFEval provides tools for principled multi-criteria analysis.

Pareto Analysis

Pareto analysis identifies methods that are not dominated by any other method on the selected metrics. A method is Pareto-optimal if no other method is better on all metrics simultaneously. This avoids the need to specify metric weights upfront:

from tscf_eval.benchmark import ParetoAnalyzer

analyzer = ParetoAnalyzer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity",
])

# Find non-dominated methods
pareto_methods = analyzer.pareto_front(results)
print(f"Pareto-optimal: {pareto_methods}")

# Full ranking table
print(analyzer.dominance_ranking(results))

# Export to LaTeX
latex = analyzer.to_latex(results, caption="Results", label="tab:results")

Visualizing Pareto Fronts

Pareto front visualizations help understand the trade-offs between metrics. The 2D plot shows which methods lie on the Pareto front (non-dominated solutions) for any pair of metrics. Consistency heatmaps show how often each method appears on the Pareto front across different datasets:

import matplotlib.pyplot as plt

# 2D Pareto front plot
ax = analyzer.plot_front(
    results,
    x_metric="proximity_dtw",
    y_metric="validity_soft",
    annotate=True,
)
plt.savefig("pareto_front.png")

# Cross-dataset consistency heatmap
results_by_dataset = {
    ds: results.filter(datasets=[ds])
    for ds in results.datasets
}
consistency_df = analyzer.consistency(results_by_dataset)
analyzer.plot_consistency_heatmap(consistency_df)
plt.savefig("consistency.png")

Weighted Scalarization

When you need a single ranking of methods, weighted scalarization combines metrics into a composite score. Each metric is min-max normalized to [0, 1] with direction awareness (maximize metrics are higher-is-better, minimize metrics are inverted), then combined via weighted sum. This enables customizable rankings based on your priorities:

from tscf_eval.benchmark import WeightedScalarizer

# Equal weights
scalarizer = WeightedScalarizer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity",
])
print(scalarizer.score(results))

# Custom weights
scalarizer = WeightedScalarizer(
    metrics=["validity_soft", "proximity_dtw", "sparsity"],
    weights={"validity_soft": 3.0, "proximity_dtw": 1.0, "sparsity": 1.0},
)

# Sensitivity analysis
sens_df = scalarizer.sensitivity(results, vary_metric="validity_soft", n_steps=11)
scalarizer.plot_sensitivity(sens_df)

Statistical Testing

When benchmarking across multiple datasets, the Friedman test determines if there are statistically significant differences between methods. It’s a non-parametric alternative to repeated-measures ANOVA, ranking methods within each dataset and testing if the average ranks differ significantly:

from tscf_eval.benchmark import friedman_test

fr = friedman_test(results, metric="validity_soft")
print(f"Statistic: {fr.statistic:.3f}, p-value: {fr.p_value:.4f}")
print(fr.rankings)

Extending TSCFEval

TSCFEval is designed to be extensible. You can add your own counterfactual methods and evaluation metrics that integrate seamlessly with the benchmarking framework.

Custom Counterfactual Method

To add a new counterfactual method, inherit from the Counterfactual base class and implement the explain method. The method receives a single instance x and returns a tuple (cf, cf_label, meta):

  • cf: The generated counterfactual (same shape as input)

  • cf_label: The predicted class label for the counterfactual

  • meta: A dictionary with method-specific metadata (e.g., generation parameters)

Here’s an example of a simple interpolation-based method:

import numpy as np
from tscf_eval.counterfactuals import Counterfactual

class MyCounterfactual(Counterfactual):
    """Custom counterfactual using nearest unlike neighbor interpolation."""

    def __init__(self, model, data, n_steps=50):
        self.model = model
        self.X_train, self.y_train = data
        self.n_steps = n_steps

    def explain(self, x, y_pred=None):
        x = np.asarray(x).squeeze()
        if y_pred is None:
            y_pred = int(self.model.predict(x.reshape(1, -1))[0])

        # Find nearest unlike neighbor
        preds = self.model.predict(self.X_train)
        unlike_mask = preds != y_pred
        unlike_samples = self.X_train[unlike_mask]
        distances = np.linalg.norm(
            unlike_samples.reshape(len(unlike_samples), -1) - x.flatten(),
            axis=1
        )
        target = unlike_samples[np.argmin(distances)]

        # Interpolate toward target until prediction flips
        cf = x.copy()
        for i in range(self.n_steps):
            alpha = (i + 1) / self.n_steps
            cf = (1 - alpha) * x + alpha * target.squeeze()
            cf_label = int(self.model.predict(cf.reshape(1, -1))[0])
            if cf_label != y_pred:
                break

        meta = {"method": "my_cf", "steps": i + 1, "alpha": alpha}
        return cf, cf_label, meta

# Use in benchmarks
from tscf_eval.benchmark import ExplainerConfig

config = ExplainerConfig("my_method", MyCounterfactual, {"n_steps": 50})

Custom Evaluation Metric

To add a new evaluation metric, inherit from the Metric base class and implement:

  • name(): Returns the metric key used in results dictionaries

  • compute(X, X_cf, **kwargs): Computes and returns the metric value

The compute method receives the original instances X, counterfactuals X_cf, and any additional keyword arguments passed to evaluate() (e.g., model, X_train, y, y_cf). Here’s an example metric that measures the maximum per-instance change:

import numpy as np
from tscf_eval.evaluator import Metric

class MaxChangeMetric(Metric):
    """Fraction of instances where max change exceeds threshold."""

    def __init__(self, threshold=0.1):
        self.threshold = threshold

    def name(self):
        return f"max_change_t{self.threshold}"

    def compute(self, X, X_cf, **kwargs):
        diff = np.abs(np.array(X) - np.array(X_cf))
        max_changes = np.max(diff.reshape(len(X), -1), axis=1)
        return float(np.mean(max_changes > self.threshold))

# Use in evaluator
from tscf_eval.evaluator import Evaluator, Validity

evaluator = Evaluator([
    Validity(),
    MaxChangeMetric(threshold=0.1),
    MaxChangeMetric(threshold=0.5),
])

Complete Workflow

This end-to-end example demonstrates a typical TSCFEval workflow: loading data, training a classifier, running a benchmark, and analyzing results with multiple analysis tools. The results are saved to JSON for later analysis or visualization:

import json
from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import UCRLoader
from tscf_eval.counterfactuals import COMTE, NativeGuide
from tscf_eval.evaluator import Evaluator, Validity, Proximity, Sparsity
from tscf_eval.benchmark import (
    BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig,
    ParetoAnalyzer, WeightedScalarizer, friedman_test,
)

# 1. Load data
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")

# 2. Train classifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(train.X, train.y)

# 3. Run benchmark
runner = BenchmarkRunner(
    datasets=[DatasetConfig("ItalyPowerDemand", train.X, train.y, test.X, test.y)],
    models=[ModelConfig("knn", clf)],
    explainers=[
        ExplainerConfig("comte", COMTE, {"distance": "dtw"}),
        ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
    ],
    evaluator=Evaluator([Validity(), Proximity(distance="dtw"), Sparsity()]),
    n_instances=20,
    instance_selection="stratified_confidence",
    verbose=True,
)
results = runner.run()

# 4. View results
print(results.to_dataframe())
print(results.aggregate(by="explainer"))

# 5. Pareto analysis
analyzer = ParetoAnalyzer(metrics=["validity_soft", "proximity_dtw", "sparsity"])
print(f"Pareto-optimal: {analyzer.pareto_front(results)}")
print(analyzer.dominance_ranking(results))

# 6. Weighted ranking
scalarizer = WeightedScalarizer(metrics=["validity_soft", "proximity_dtw", "sparsity"])
print(scalarizer.score(results))

# 7. Save results
with open("results.json", "w") as f:
    json.dump(results.to_dict(), f, indent=2)