Benchmark Module

The benchmark module provides a framework for systematic evaluation of counterfactual explanation methods across multiple datasets and classifiers.

Features

Parallel Execution: Run multiple explainers in parallel using n_jobs parameter
Progress Tracking: Built-in progress bars with tqdm
Evaluation Metrics: 10 built-in metrics in six quality dimensions
Pareto Analysis: Multi-criteria optimization with visualization
Result Export: Save/load results in JSON format

BenchmarkRunner

class tscf_eval.BenchmarkRunner[source]

Bases: object

Orchestrates benchmark execution across datasets, models, and explainers.

Supports parallel execution of independent tasks and provides progress tracking when tqdm is available.

Parameters:

datasets (list[DatasetConfig]) – Datasets to benchmark on.
models (list[ModelConfig]) – Fitted classifier models to use.
explainers (list[ExplainerConfig]) – Explainer configurations to evaluate.
evaluator (Evaluator, optional) – Evaluator with metrics. Defaults to all available metrics.
n_instances (int, optional) – Number of test instances per dataset. None uses all.
instance_selection ({"random", "stratified_confidence"}, default "random") – Strategy for selecting test instances.
- "random": Uniform random sampling.
- "stratified_confidence": Stratified sampling based on model prediction confidence. Divides instances into quantile-based confidence bins and samples from each bin, ensuring coverage of both high-confidence and low-confidence instances.
n_jobs (int, default 1) – Number of parallel jobs. Use -1 for all CPUs. Requires joblib to be installed.
verbose (bool, default True) – Show progress information.
random_state (int, optional) – Random seed for reproducibility when subsampling.

Examples

>>> from tscf_eval.benchmark import (
...     BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig
... )
>>> from tscf_eval import COMTE, NativeGuide
>>>
>>> datasets = [DatasetConfig("DS1", X_tr, y_tr, X_te, y_te)]
>>> models = [ModelConfig("knn", fitted_knn)]
>>> explainers = [
...     ExplainerConfig("comte", COMTE),
...     ExplainerConfig("ng", NativeGuide),
... ]
>>>
>>> runner = BenchmarkRunner(datasets, models, explainers, n_jobs=-1)
>>> results = runner.run()
>>> print(results.summary())

datasets: list[DatasetConfig]

models: list[ModelConfig]

explainers: list[ExplainerConfig]

evaluator: Evaluator | None = None

n_instances: int | None = None

instance_selection: Literal['random', 'stratified_confidence'] = 'random'

n_jobs: int = 1

verbose: bool = True

random_state: int | None = None

__post_init__()[source]

Validate configuration and set up evaluator.

Return type:: None

run()[source]

Execute the benchmark.

Return type:: BenchmarkResults
Returns:: BenchmarkResults – Results container with all evaluation metrics.

__init__(datasets, models, explainers, evaluator=None, n_instances=None, instance_selection='random', n_jobs=1, verbose=True, random_state=None)

Parameters:

datasets (list[DatasetConfig])
models (list[ModelConfig])
explainers (list[ExplainerConfig])
evaluator (Evaluator | None)
n_instances (int | None)
instance_selection (Literal['random', 'stratified_confidence'])
n_jobs (int)
verbose (bool)
random_state (int | None)

Return type:

None

Instance Selection

The benchmark supports multiple strategies for selecting test instances via the instance_selection parameter on BenchmarkRunner.

SelectionStrategy

tscf_eval.benchmark.SelectionStrategy: alias of Literal[‘random’, ‘stratified_confidence’]

Two strategies are available:

"random" (default): Uniform random sampling without replacement.
"stratified_confidence": Instances are grouped into 4 quantile-based confidence bins (25th, 50th, 75th percentiles of max predicted probability), with an equal number of instances sampled from each bin. This ensures coverage of both high-confidence and uncertain instances.

Falls back to random selection with a warning if the model does not support predict_proba or if n_instances < 4.

Example:

from tscf_eval.benchmark import BenchmarkRunner

runner = BenchmarkRunner(
    ...,
    instance_selection="stratified_confidence",
)

select_instances

tscf_eval.benchmark.selection.select_instances(dataset, model, n_instances, strategy, random_state)[source]

Select test instances according to the given strategy.

Parameters:

dataset (DatasetConfig) – Dataset containing test instances.
model (ModelConfig) – Fitted model (used for confidence-based strategies).
n_instances (int or None) – Number of instances to select. None means use all.
strategy ({"random", "stratified_confidence"}) – Instance selection strategy.
random_state (int or None) – Random seed for reproducibility.

Return type:

tuple[ndarray, ndarray | None, ndarray | None]

Returns:

X_test (np.ndarray) – Selected test instances.
y_test (np.ndarray or None) – Corresponding labels, or None if not available.
bin_indices (np.ndarray or None) – Confidence bin assignment for each selected instance (computed over the full test set), or None when stratified binning was not performed (e.g. random strategy, no predict_proba, or no subsampling).

Parameters:

dataset (DatasetConfig)
model (ModelConfig)
n_instances (int | None)
strategy (SelectionStrategy)
random_state (int | None)

Configuration Classes

ExplainerConfig

class tscf_eval.ExplainerConfig[source]

Bases: object

Configuration for a counterfactual explainer.

Parameters:

name (str) – Unique identifier for this configuration (used in results).
explainer_class (type[Counterfactual]) – The counterfactual explainer class (e.g., COMTE, NativeGuide).
params (dict, optional) – Parameters to pass to the explainer constructor. model and data are provided automatically by the runner.
n_counterfactuals (int, default 1) – Number of counterfactuals to generate per instance. When > 1, uses explain_k() method.

Examples

>>> from tscf_eval import COMTE, NativeGuide
>>> from tscf_eval.benchmark import ExplainerConfig
>>>
>>> configs = [
...     ExplainerConfig("comte_dtw", COMTE, {"distance": "dtw"}),
...     ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
... ]

name: str

explainer_class: type[Counterfactual]

params: dict[str, Any]

n_counterfactuals: int = 1

create_explainer(model, data)[source]

Instantiate the explainer with model and training data.

Parameters:

model (Any) – Fitted classifier.
data (tuple[np.ndarray, np.ndarray]) – Tuple of (X_train, y_train).

Return type:

Counterfactual

Returns:

Counterfactual – Configured explainer instance.

Parameters:

model (Any)
data (tuple[ndarray, ndarray])

__init__(name, explainer_class, params=<factory>, n_counterfactuals=1)

Parameters:

name (str)
explainer_class (type[Counterfactual])
params (dict[str, Any])
n_counterfactuals (int)

Return type:

None

DatasetConfig

class tscf_eval.DatasetConfig[source]

Bases: object

Configuration for a dataset in benchmarks.

Parameters:

name (str) – Unique identifier for this dataset.
X_train (np.ndarray) – Training features, shape (n_train, series_length) or (n_train, n_channels, series_length).
y_train (np.ndarray) – Training labels, shape (n_train,).
X_test (np.ndarray) – Test features, same shape convention as X_train.
y_test (np.ndarray, optional) – Test labels, shape (n_test,). Required for some metrics.

Examples

>>> from tscf_eval.benchmark import DatasetConfig
>>>
>>> dataset = DatasetConfig(
...     name="GunPoint",
...     X_train=X_train,
...     y_train=y_train,
...     X_test=X_test,
...     y_test=y_test,
... )

name: str

X_train: ndarray

y_train: ndarray

X_test: ndarray

y_test: ndarray | None = None

__post_init__()[source]

Validate and coerce array fields to numpy arrays.

Return type:: None

property n_train: int

Number of training instances.

Returns:: int – Length of X_train along axis 0.

property n_test: int

Number of test instances.

Returns:: int – Length of X_test along axis 0.

property series_length: int

Return the length of each time series.

Returns:: int – Last dimension of X_train.

__init__(name, X_train, y_train, X_test, y_test=None)

Parameters:

name (str)
X_train (ndarray)
y_train (ndarray)
X_test (ndarray)
y_test (ndarray | None)

Return type:

None

ModelConfig

class tscf_eval.ModelConfig[source]

Bases: object

Configuration for a classifier model in benchmarks.

The model must be pre-fitted before being passed to the benchmark.

Parameters:

name (str) – Unique identifier for this model (e.g., “knn_dtw”, “rocket”).
model (Any) – Pre-trained classifier with predict() method. Should also have predict_proba() for some metrics.

Examples

>>> from sklearn.neighbors import KNeighborsClassifier
>>> from tscf_eval.benchmark import ModelConfig
>>>
>>> knn = KNeighborsClassifier(n_neighbors=1)
>>> knn.fit(X_train, y_train)
>>>
>>> model_config = ModelConfig("knn", knn)

name: str

model: Any

predict(X)[source]

Predict class labels for samples.

Parameters:: X (np.ndarray) – Input instances, shape (n_samples, ...).
Return type:: ndarray
Returns:: np.ndarray – Predicted class labels, shape (n_samples,).
Parameters:: X (ndarray)

predict_proba(X)[source]

Predict class probabilities if available.

Parameters:: X (np.ndarray) – Input instances, shape (n_samples, ...).
Return type:: ndarray | None
Returns:: np.ndarray or None – Class probabilities of shape (n_samples, n_classes), or None if the model lacks predict_proba.
Parameters:: X (ndarray)

__init__(name, model)

Parameters:

name (str)
model (Any)

Return type:

None

Result Classes

ExplainerResult

class tscf_eval.ExplainerResult[source]

Bases: object

Results for a single explainer on a single dataset/model combination.

Parameters:

explainer_name (str) – Name of the explainer configuration.
dataset_name (str) – Name of the dataset.
model_name (str) – Name of the model.
X_cf (np.ndarray) – Generated counterfactuals, shape (n_instances, ...) or (n_instances, k, ...) if k > 1.
y_cf (np.ndarray) – Predicted labels for counterfactuals.
success_mask (np.ndarray) – Boolean mask indicating successful generations.
metrics (dict[str, Any]) – Evaluation metrics computed by the Evaluator.
generation_times (list[float]) – Per-instance generation times in seconds.
metadata (list[dict]) – Per-instance metadata from the explainer.

explainer_name: str

dataset_name: str

model_name: str

X_cf: ndarray

y_cf: ndarray

success_mask: ndarray

metrics: dict[str, Any]

generation_times: list[float]

metadata: list[dict[str, Any]]

property n_instances: int

Return the number of test instances.

Returns:: int – Length of X_cf along axis 0.

property n_successful: int

Return the number of successfully generated counterfactuals.

Returns:: int – Count of True values in success_mask.

property success_rate: float

Return the fraction of successful generations.

Returns:: float – Ratio n_successful / n_instances, in [0, 1].

property mean_time: float

Return the mean generation time per instance in seconds.

Returns:: float – Average of generation_times, or 0.0 if empty.

property total_time: float

Return the total generation time in seconds.

Returns:: float – Sum of generation_times.

get_metric(name, default=None)[source]

Get a metric value by name, with optional default.

Parameters:

name (str) – Metric key to look up in self.metrics.
default (Any, default None) – Value to return if the metric is not found.

Return type:

Any

Returns:

Any – The metric value, or default if not present.

Parameters:

name (str)
default (Any)

__init__(explainer_name, dataset_name, model_name, X_cf, y_cf, success_mask, metrics, generation_times, metadata)

Parameters:

explainer_name (str)
dataset_name (str)
model_name (str)
X_cf (ndarray)
y_cf (ndarray)
success_mask (ndarray)
metrics (dict[str, Any])
generation_times (list[float])
metadata (list[dict[str, Any]])

Return type:

None

BenchmarkResults

class tscf_eval.BenchmarkResults[source]

Bases: object

Container for all benchmark results with analysis methods.

Stores results indexed by (dataset, model, explainer) combinations. Provides methods for querying, aggregating, and exporting results.

Examples

>>> results = runner.run()
>>>
>>> # Get specific result
>>> result = results.get("GunPoint", "knn", "comte")
>>>
>>> # Get comparison DataFrame
>>> df = results.to_dataframe()
>>>
>>> # Iterate over results
>>> for result in results:
...     print(f"{result.dataset_name}/{result.model_name}: {result.metrics}")

add(result)[source]

Add a result to the collection.

Parameters:: result (ExplainerResult) – Result to store, keyed by (dataset, model, explainer).
Return type:: None
Parameters:: result (ExplainerResult)

get(dataset, model, explainer)[source]

Get the result for a specific (dataset, model, explainer) combination.

Parameters:

dataset (str) – Dataset name.
model (str) – Model name.
explainer (str) – Explainer name.

Return type:

ExplainerResult | None

Returns:

ExplainerResult or None – The matching result, or None if not found.

Parameters:

dataset (str)
model (str)
explainer (str)

__iter__()[source]

Iterate over all stored results.

Return type:: Iterator[ExplainerResult]
Returns:: Iterator[ExplainerResult] – Iterator yielding each result.

__len__()[source]

Return the number of stored results.

Return type:: int
Returns:: int – Total number of (dataset, model, explainer) entries.

property datasets: list[str]

Return the sorted list of unique dataset names.

Returns:: list[str] – Unique dataset names across all stored results.

property models: list[str]

Return the sorted list of unique model names.

Returns:: list[str] – Unique model names across all stored results.

property explainers: list[str]

Return the sorted list of unique explainer names.

Returns:: list[str] – Unique explainer names across all stored results.

filter(datasets=None, models=None, explainers=None)[source]

Create a filtered copy of results.

Parameters:

datasets (list[str], optional) – Filter to these datasets. None means all.
models (list[str], optional) – Filter to these models. None means all.
explainers (list[str], optional) – Filter to these explainers. None means all.

Return type:

BenchmarkResults

Returns:

BenchmarkResults – New results containing only matching entries.

Parameters:

datasets (list[str] | None)
models (list[str] | None)
explainers (list[str] | None)

to_dataframe(metrics=None, include_timing=True)[source]

Convert results to a pandas DataFrame.

Parameters:

metrics (list[str], optional) – Specific metrics to include. None means all available.
include_timing (bool, default True) – Include timing columns (mean_time, total_time).

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns for dataset, model, explainer, and metrics.

Parameters:

metrics (list[str] | None)
include_timing (bool)

aggregate(by='explainer', metrics=None, aggfunc='mean')[source]

Aggregate metrics across a dimension.

Parameters:

by (str, default "explainer") – Dimension to group by: “explainer”, “model”, or “dataset”.
metrics (list[str], optional) – Metrics to aggregate. None means all numeric.
aggfunc (str or list[str], default "mean") – Aggregation function(s): “mean”, “median”, “std”, “min”, “max”. When a list is provided (e.g. ["mean", "std"]), the returned DataFrame has a MultiIndex on columns with levels (metric, aggfunc).

Return type:

DataFrame

Returns:

pd.DataFrame – Aggregated results.

Parameters:

by (str)
metrics (list[str] | None)
aggfunc (str | list[str])

summary()[source]

Get summary statistics aggregated by explainer.

Return type:: DataFrame
Returns:: pd.DataFrame – Summary with mean metrics per explainer across all datasets/models.

to_dict()[source]

Convert to nested dictionary for serialization.

Return type:: dict[str, Any]

classmethod from_dict(data)[source]

Reconstruct from a dictionary produced by to_dict().

Only metrics and metadata are restored; counterfactual arrays (X_cf, y_cf) are not stored by to_dict and will be set to empty arrays.

Parameters:: data (dict) – Dictionary as returned by to_dict or loaded from JSON.
Return type:: BenchmarkResults
Returns:: BenchmarkResults
Parameters:: data (dict[str, Any])

__init__(_results=<factory>)

Parameters:: _results (dict[tuple[str, str, str], ExplainerResult])
Return type:: None

Pareto Analysis

ParetoAnalyzer

Multi-criteria Pareto analysis with visualization support.

class tscf_eval.ParetoAnalyzer[source]

Bases: object

Multi-objective Pareto dominance analysis.

Analyzes benchmark results to identify Pareto-optimal solutions and compute dominance rankings. Includes plotting utilities for Pareto fronts and cross-dataset consistency analysis.

Parameters:

metrics (list[str]) – Metric names to use for Pareto analysis.
directions (dict[str, Literal[``”min”, ``"max"]], optional) – Override metric directions. Keys are metric names, values are "min" or "max".

Examples

>>> from tscf_eval.benchmark import ParetoAnalyzer
>>>
>>> analyzer = ParetoAnalyzer(["validity", "proximity_l2", "mean_time_s"])
>>> ranking = analyzer.dominance_ranking(results)
>>> pareto_optimal = analyzer.pareto_front(results)

metrics: list[str]

directions: dict[str, Literal['min', 'max']]

__post_init__()[source]

Validate that at least one metric is provided.

Return type:: None

pareto_front(results, aggregate_by='explainer')[source]

Find Pareto-optimal (non-dominated) solutions.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

list[str]

Returns:

list[str] – Names of Pareto-optimal solutions.

Parameters:

results (BenchmarkResults)
aggregate_by (str)

dominance_count(results, aggregate_by='explainer')[source]

Count how many solutions each solution dominates.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

dict[str, int]

Returns:

dict[str, int] – Mapping from name to number of dominated solutions.

Parameters:

results (BenchmarkResults)
aggregate_by (str)

dominated_by_count(results, aggregate_by='explainer')[source]

Count how many solutions dominate each solution.

Lower is better (0 means Pareto-optimal).

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

dict[str, int]

Returns:

dict[str, int] – Mapping from name to number of dominating solutions.

Parameters:

results (BenchmarkResults)
aggregate_by (str)

dominance_ranking(results, aggregate_by='explainer')[source]

Compute dominance ranking table.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns name, dominated_by, dominates, pareto, plus one column per metric.

Parameters:

results (BenchmarkResults)
aggregate_by (str)

to_dataframe(results, aggregate_by='explainer')[source]

Get DataFrame with metric values and Pareto status.

Alias for dominance_ranking().

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns name, dominated_by, dominates, pareto, plus one column per metric.

Parameters:

results (BenchmarkResults)
aggregate_by (str)

plot_front(results, x_metric, y_metric, aggregate_by='explainer', ax=None, annotate=True, pareto_color='tab:blue', other_color='grey', pareto_marker='o', other_marker='x', title=None)[source]

Plot a 2-D Pareto front scatter.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
x_metric, y_metric (str) – Metrics for the x and y axes.
aggregate_by (str, default "explainer") – Dimension to aggregate by.
ax (matplotlib Axes, optional) – Axes to plot on. Created if None.
annotate (bool, default True) – Label each point with the entity name.
pareto_color (str, default "tab:blue") – Color for Pareto-optimal points.
other_color (str, default "grey") – Color for dominated points.
pareto_marker (str, default "o") – Marker for Pareto-optimal points.
other_marker (str, default "x") – Marker for dominated points.
title (str, optional) – Plot title. Defaults to "Pareto Front".

Return type:

Axes

Returns:

matplotlib.axes.Axes

Parameters:

results (BenchmarkResults)
x_metric (str)
y_metric (str)
aggregate_by (str)
ax (Axes | None)
annotate (bool)
pareto_color (str)
other_color (str)
pareto_marker (str)
other_marker (str)
title (str | None)

consistency(results_dict, aggregate_by='explainer')[source]

Compute cross-dataset Pareto consistency matrix.

For each dataset identifies the Pareto-optimal solutions and returns a boolean matrix (entity x dataset) with a count column.

Parameters:

results_dict (dict[str, BenchmarkResults]) – Mapping from dataset/group name to its benchmark results.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – Boolean DataFrame with entities as rows, datasets as columns, plus a count column. Sorted by count descending.

Parameters:

results_dict (dict[str, BenchmarkResults])
aggregate_by (str)

plot_consistency_heatmap(consistency_df, ax=None, cmap='YlGn', title=None)[source]

Plot Pareto consistency as a heatmap.

Parameters:

consistency_df (pd.DataFrame) – Output of consistency().
ax (matplotlib Axes, optional) – Axes to plot on. Created if None.
cmap (str, default "YlGn") – Matplotlib colormap name.
title (str, optional) – Plot title.

Return type:

Axes

Returns:

matplotlib.axes.Axes

Parameters:

consistency_df (DataFrame)
ax (Axes | None)
cmap (str)
title (str | None)

to_latex(results, aggregate_by='explainer', precision=3, caption=None, label=None)[source]

Generate a LaTeX table of the dominance ranking.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.
precision (int, default 3) – Number of decimal places.
caption, label (str, optional) – LaTeX caption and label.

Return type:

str

Returns:

str – LaTeX table source code.

Parameters:

results (BenchmarkResults)
aggregate_by (str)
precision (int)
caption (str | None)
label (str | None)

__init__(metrics, directions=<factory>)

Parameters:

metrics (list[str])
directions (dict[str, Literal['min', 'max']])

Return type:

None

Visualization and Analysis Methods

ParetoAnalyzer provides these methods:

Method	Description
`pareto_front()`	Identify Pareto-optimal (non-dominated) solutions
`dominance_ranking()`	Full dominance ranking table with metric values
`plot_front()`	2D scatter plot showing Pareto front between two objectives
`consistency()`	Cross-dataset Pareto consistency matrix
`plot_consistency_heatmap()`	Heatmap of Pareto consistency across datasets
`to_latex()`	Generate LaTeX table of the dominance ranking

Examples:

from tscf_eval.benchmark import ParetoAnalyzer
import matplotlib.pyplot as plt

# Create analyzer with metric names (directions are inferred)
analyzer = ParetoAnalyzer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity", "efficiency_time_s",
])

# Identify Pareto-optimal methods
pareto_methods = analyzer.pareto_front(results)
print(f"Pareto-optimal: {pareto_methods}")

# Full dominance ranking table
ranking = analyzer.dominance_ranking(results)
print(ranking)

# 2D Pareto front plot
ax = analyzer.plot_front(
    results,
    x_metric="proximity_dtw",
    y_metric="validity_soft",
    annotate=True,
)
plt.savefig("pareto_front.png")

# Cross-dataset consistency analysis
results_by_dataset = {
    ds: results.filter(datasets=[ds])
    for ds in results.datasets
}
consistency_df = analyzer.consistency(results_by_dataset)
analyzer.plot_consistency_heatmap(consistency_df)
plt.savefig("consistency_heatmap.png")

# Export to LaTeX
latex = analyzer.to_latex(results, caption="Pareto Ranking", label="tab:pareto")

Weighted Scalarization

WeightedScalarizer

Min-max normalized weighted composite scoring for ranking methods.

class tscf_eval.WeightedScalarizer[source]

Bases: object

Min-max normalized weighted composite scoring.

Normalizes each metric to [0, 1] via min-max scaling (respecting metric directions so that higher normalized values are always better), then computes a weighted sum.

Parameters:

metrics (list[str]) – Metric names to include in the composite score.
weights (dict[str, float], optional) – Per-metric weights. Automatically normalized to sum to 1. If None, all metrics are weighted equally.
directions (dict[str, Literal[``”min”, ``"max"]], optional) – Override metric directions.

Examples

>>> scalarizer = WeightedScalarizer(
...     ["validity", "proximity_l2", "sparsity"],
...     weights={"validity": 2.0, "proximity_l2": 1.0, "sparsity": 1.0},
... )
>>> scores = scalarizer.score(results)

metrics: list[str]

weights: dict[str, float]

directions: dict[str, Literal['min', 'max']]

__post_init__()[source]

Validate metrics and normalize weights to sum to 1.

Return type:: None

score(results, aggregate_by='explainer')[source]

Compute weighted composite scores.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with normalized metric columns plus composite. Sorted by composite descending.

Parameters:

results (BenchmarkResults)
aggregate_by (str)

sensitivity(results, vary_metric, n_steps=11, aggregate_by='explainer')[source]

Sensitivity analysis by sweeping one metric’s weight.

Varies the weight of vary_metric from 0 to 1, redistributing the remaining weight proportionally among the other metrics.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
vary_metric (str) – The metric whose weight to sweep.
n_steps (int, default 11) – Number of weight values (0 to 1 inclusive).
aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – Long-format DataFrame with columns weight, <aggregate_by>, composite.

Parameters:

results (BenchmarkResults)
vary_metric (str)
n_steps (int)
aggregate_by (str)

plot_sensitivity(sensitivity_df, aggregate_by='explainer', ax=None, title=None)[source]

Plot sensitivity analysis results.

Parameters:

sensitivity_df (pd.DataFrame) – Output of sensitivity().
aggregate_by (str, default "explainer") – Column name for the entity dimension.
ax (matplotlib Axes, optional) – Axes to plot on. Created if None.
title (str, optional) – Plot title.

Return type:

Axes

Returns:

matplotlib.axes.Axes

Parameters:

sensitivity_df (DataFrame)
aggregate_by (str)
ax (Axes | None)
title (str | None)

to_latex(results, aggregate_by='explainer', precision=3, caption=None, label=None)[source]

Generate a LaTeX table of weighted scores.

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
aggregate_by (str, default "explainer") – Dimension to aggregate by.
precision (int, default 3) – Number of decimal places.
caption, label (str, optional) – LaTeX caption and label.

Return type:

str

Returns:

str – LaTeX table source code.

Parameters:

results (BenchmarkResults)
aggregate_by (str)
precision (int)
caption (str | None)
label (str | None)

__init__(metrics, weights=<factory>, directions=<factory>)

Parameters:

metrics (list[str])
weights (dict[str, float])
directions (dict[str, Literal['min', 'max']])

Return type:

None

Example:

from tscf_eval.benchmark import WeightedScalarizer

# Equal-weight composite across metrics
scalarizer = WeightedScalarizer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity",
])
scores = scalarizer.score(results)

# Custom weights emphasizing validity
scalarizer = WeightedScalarizer(
    metrics=["validity_soft", "proximity_dtw", "sparsity"],
    weights={"validity_soft": 3.0, "proximity_dtw": 1.0, "sparsity": 1.0},
)

# Sensitivity analysis
sens_df = scalarizer.sensitivity(results, vary_metric="validity_soft", n_steps=11)
scalarizer.plot_sensitivity(sens_df)

Statistical Testing

friedman_test

tscf_eval.benchmark.friedman_test(results, metric, aggregate_by='explainer', group_by='dataset')[source]

Run a Friedman test comparing explainers across groups.

The Friedman test is a non-parametric test for detecting differences in treatments across multiple groups (e.g., explainers across datasets).

Parameters:

results (BenchmarkResults) – Benchmark results to analyze.
metric (str) – Metric name to test.
aggregate_by (str, default "explainer") – Treatments to compare (columns of the rank matrix).
group_by (str, default "dataset") – Blocking factor (rows of the rank matrix).

Return type:

FriedmanResult

Returns:

FriedmanResult – Named tuple with statistic, p_value, and rankings.

Raises:

ImportError – If scipy is not installed.
ValueError – If there are fewer than 3 treatments or fewer than 2 groups.

Parameters:

results (BenchmarkResults)
metric (str)
aggregate_by (str)
group_by (str)

Example:

from tscf_eval.benchmark import friedman_test

fr = friedman_test(results, metric="validity_soft")
print(f"Statistic: {fr.statistic:.3f}, p-value: {fr.p_value:.4f}")
print(fr.rankings)

FriedmanResult

A NamedTuple with three fields:

statistic (float) – Friedman chi-squared statistic.
p_value (float) – p-value of the test.
rankings (pd.DataFrame) – Mean ranks per explainer for each metric.

LaTeX Table Generation

tscf_eval.benchmark.format_latex_table(df, directions=None, bold_best=True, arrows=True, precision=3, midrule_every=0, escape_underscores=True, caption=None, label=None)[source]

Format a DataFrame as a LaTeX table with best-value highlighting.

Parameters:

df (pd.DataFrame) – DataFrame to format. First column is typically the entity name (explainer/model), remaining columns are metrics.
directions (dict[str, bool], optional) – Mapping of column name to True if higher is better. If None, uses the built-in metric direction registry.
bold_best (bool, default True) – Bold the best value in each numeric column.
arrows (bool, default True) – Append directional arrows to column headers.
precision (int, default 3) – Number of decimal places for floats.
midrule_every (int, default 0) – Insert \midrule every N data rows. 0 means no midrules.
escape_underscores (bool, default True) – Replace _ with \_ in column headers and string cells.
caption (str, optional) – LaTeX table caption.
label (str, optional) – LaTeX table label for cross-referencing.

Return type:

str

Returns:

str – LaTeX table source code.

Parameters:

df (DataFrame)
directions (dict[str, bool] | None)
bold_best (bool)
arrows (bool)
precision (int)
midrule_every (int)
escape_underscores (bool)
caption (str | None)
label (str | None)