Benchmark Module

The benchmark module provides a framework for systematic evaluation of counterfactual explanation methods across multiple datasets and classifiers.

Features

  • Parallel Execution: Run multiple explainers in parallel using n_jobs parameter

  • Progress Tracking: Built-in progress bars with tqdm

  • Evaluation Metrics: 10 built-in metrics in six quality dimensions

  • Pareto Analysis: Multi-criteria optimization with visualization

  • Result Export: Save/load results in JSON format

BenchmarkRunner

class tscf_eval.BenchmarkRunner[source]

Bases: object

Orchestrates benchmark execution across datasets, models, and explainers.

Supports parallel execution of independent tasks and provides progress tracking when tqdm is available.

Parameters:
  • datasets (list[DatasetConfig]) – Datasets to benchmark on.

  • models (list[ModelConfig]) – Fitted classifier models to use.

  • explainers (list[ExplainerConfig]) – Explainer configurations to evaluate.

  • evaluator (Evaluator, optional) – Evaluator with metrics. Defaults to all available metrics.

  • n_instances (int, optional) – Number of test instances per dataset. None uses all.

  • instance_selection ({"random", "stratified_confidence"}, default "random") – Strategy for selecting test instances.

    • "random": Uniform random sampling.

    • "stratified_confidence": Stratified sampling based on model prediction confidence. Divides instances into quantile-based confidence bins and samples from each bin, ensuring coverage of both high-confidence and low-confidence instances.

  • n_jobs (int, default 1) – Number of parallel jobs. Use -1 for all CPUs. Requires joblib to be installed.

  • verbose (bool, default True) – Show progress information.

  • random_state (int, optional) – Random seed for reproducibility when subsampling.

Examples

>>> from tscf_eval.benchmark import (
...     BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig
... )
>>> from tscf_eval import COMTE, NativeGuide
>>>
>>> datasets = [DatasetConfig("DS1", X_tr, y_tr, X_te, y_te)]
>>> models = [ModelConfig("knn", fitted_knn)]
>>> explainers = [
...     ExplainerConfig("comte", COMTE),
...     ExplainerConfig("ng", NativeGuide),
... ]
>>>
>>> runner = BenchmarkRunner(datasets, models, explainers, n_jobs=-1)
>>> results = runner.run()
>>> print(results.summary())
datasets: list[DatasetConfig]
models: list[ModelConfig]
explainers: list[ExplainerConfig]
evaluator: Evaluator | None = None
n_instances: int | None = None
instance_selection: Literal['random', 'stratified_confidence'] = 'random'
n_jobs: int = 1
verbose: bool = True
random_state: int | None = None
__post_init__()[source]

Validate configuration and set up evaluator.

Return type:

None

run()[source]

Execute the benchmark.

Return type:

BenchmarkResults

Returns:

BenchmarkResults – Results container with all evaluation metrics.

__init__(datasets, models, explainers, evaluator=None, n_instances=None, instance_selection='random', n_jobs=1, verbose=True, random_state=None)
Parameters:
Return type:

None

Instance Selection

The benchmark supports multiple strategies for selecting test instances via the instance_selection parameter on BenchmarkRunner.

SelectionStrategy

tscf_eval.benchmark.SelectionStrategy

alias of Literal[‘random’, ‘stratified_confidence’]

Two strategies are available:

  • "random" (default): Uniform random sampling without replacement.

  • "stratified_confidence": Instances are grouped into 4 quantile-based confidence bins (25th, 50th, 75th percentiles of max predicted probability), with an equal number of instances sampled from each bin. This ensures coverage of both high-confidence and uncertain instances.

Falls back to random selection with a warning if the model does not support predict_proba or if n_instances < 4.

Example:

from tscf_eval.benchmark import BenchmarkRunner

runner = BenchmarkRunner(
    ...,
    instance_selection="stratified_confidence",
)

select_instances

tscf_eval.benchmark.selection.select_instances(dataset, model, n_instances, strategy, random_state)[source]

Select test instances according to the given strategy.

Parameters:
  • dataset (DatasetConfig) – Dataset containing test instances.

  • model (ModelConfig) – Fitted model (used for confidence-based strategies).

  • n_instances (int or None) – Number of instances to select. None means use all.

  • strategy ({"random", "stratified_confidence"}) – Instance selection strategy.

  • random_state (int or None) – Random seed for reproducibility.

Return type:

tuple[ndarray, ndarray | None, ndarray | None]

Returns:

  • X_test (np.ndarray) – Selected test instances.

  • y_test (np.ndarray or None) – Corresponding labels, or None if not available.

  • bin_indices (np.ndarray or None) – Confidence bin assignment for each selected instance (computed over the full test set), or None when stratified binning was not performed (e.g. random strategy, no predict_proba, or no subsampling).

Parameters:

Configuration Classes

ExplainerConfig

class tscf_eval.ExplainerConfig[source]

Bases: object

Configuration for a counterfactual explainer.

Parameters:
  • name (str) – Unique identifier for this configuration (used in results).

  • explainer_class (type[Counterfactual]) – The counterfactual explainer class (e.g., COMTE, NativeGuide).

  • params (dict, optional) – Parameters to pass to the explainer constructor. model and data are provided automatically by the runner.

  • n_counterfactuals (int, default 1) – Number of counterfactuals to generate per instance. When > 1, uses explain_k() method.

Examples

>>> from tscf_eval import COMTE, NativeGuide
>>> from tscf_eval.benchmark import ExplainerConfig
>>>
>>> configs = [
...     ExplainerConfig("comte_dtw", COMTE, {"distance": "dtw"}),
...     ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
... ]
name: str
explainer_class: type[Counterfactual]
params: dict[str, Any]
n_counterfactuals: int = 1
create_explainer(model, data)[source]

Instantiate the explainer with model and training data.

Parameters:
  • model (Any) – Fitted classifier.

  • data (tuple[np.ndarray, np.ndarray]) – Tuple of (X_train, y_train).

Return type:

Counterfactual

Returns:

Counterfactual – Configured explainer instance.

Parameters:
__init__(name, explainer_class, params=<factory>, n_counterfactuals=1)
Parameters:
Return type:

None

DatasetConfig

class tscf_eval.DatasetConfig[source]

Bases: object

Configuration for a dataset in benchmarks.

Parameters:
  • name (str) – Unique identifier for this dataset.

  • X_train (np.ndarray) – Training features, shape (n_train, series_length) or (n_train, n_channels, series_length).

  • y_train (np.ndarray) – Training labels, shape (n_train,).

  • X_test (np.ndarray) – Test features, same shape convention as X_train.

  • y_test (np.ndarray, optional) – Test labels, shape (n_test,). Required for some metrics.

Examples

>>> from tscf_eval.benchmark import DatasetConfig
>>>
>>> dataset = DatasetConfig(
...     name="GunPoint",
...     X_train=X_train,
...     y_train=y_train,
...     X_test=X_test,
...     y_test=y_test,
... )
name: str
X_train: ndarray
y_train: ndarray
X_test: ndarray
y_test: ndarray | None = None
__post_init__()[source]

Validate and coerce array fields to numpy arrays.

Return type:

None

property n_train: int

Number of training instances.

Returns:

int – Length of X_train along axis 0.

property n_test: int

Number of test instances.

Returns:

int – Length of X_test along axis 0.

property series_length: int

Return the length of each time series.

Returns:

int – Last dimension of X_train.

__init__(name, X_train, y_train, X_test, y_test=None)
Parameters:
Return type:

None

ModelConfig

class tscf_eval.ModelConfig[source]

Bases: object

Configuration for a classifier model in benchmarks.

The model must be pre-fitted before being passed to the benchmark.

Parameters:
  • name (str) – Unique identifier for this model (e.g., “knn_dtw”, “rocket”).

  • model (Any) – Pre-trained classifier with predict() method. Should also have predict_proba() for some metrics.

Examples

>>> from sklearn.neighbors import KNeighborsClassifier
>>> from tscf_eval.benchmark import ModelConfig
>>>
>>> knn = KNeighborsClassifier(n_neighbors=1)
>>> knn.fit(X_train, y_train)
>>>
>>> model_config = ModelConfig("knn", knn)
name: str
model: Any
predict(X)[source]

Predict class labels for samples.

Parameters:

X (np.ndarray) – Input instances, shape (n_samples, ...).

Return type:

ndarray

Returns:

np.ndarray – Predicted class labels, shape (n_samples,).

Parameters:

X (ndarray)

predict_proba(X)[source]

Predict class probabilities if available.

Parameters:

X (np.ndarray) – Input instances, shape (n_samples, ...).

Return type:

ndarray | None

Returns:

np.ndarray or None – Class probabilities of shape (n_samples, n_classes), or None if the model lacks predict_proba.

Parameters:

X (ndarray)

__init__(name, model)
Parameters:
Return type:

None

Result Classes

ExplainerResult

class tscf_eval.ExplainerResult[source]

Bases: object

Results for a single explainer on a single dataset/model combination.

Parameters:
  • explainer_name (str) – Name of the explainer configuration.

  • dataset_name (str) – Name of the dataset.

  • model_name (str) – Name of the model.

  • X_cf (np.ndarray) – Generated counterfactuals, shape (n_instances, ...) or (n_instances, k, ...) if k > 1.

  • y_cf (np.ndarray) – Predicted labels for counterfactuals.

  • success_mask (np.ndarray) – Boolean mask indicating successful generations.

  • metrics (dict[str, Any]) – Evaluation metrics computed by the Evaluator.

  • generation_times (list[float]) – Per-instance generation times in seconds.

  • metadata (list[dict]) – Per-instance metadata from the explainer.

explainer_name: str
dataset_name: str
model_name: str
X_cf: ndarray
y_cf: ndarray
success_mask: ndarray
metrics: dict[str, Any]
generation_times: list[float]
metadata: list[dict[str, Any]]
property n_instances: int

Return the number of test instances.

Returns:

int – Length of X_cf along axis 0.

property n_successful: int

Return the number of successfully generated counterfactuals.

Returns:

int – Count of True values in success_mask.

property success_rate: float

Return the fraction of successful generations.

Returns:

float – Ratio n_successful / n_instances, in [0, 1].

property mean_time: float

Return the mean generation time per instance in seconds.

Returns:

float – Average of generation_times, or 0.0 if empty.

property total_time: float

Return the total generation time in seconds.

Returns:

float – Sum of generation_times.

get_metric(name, default=None)[source]

Get a metric value by name, with optional default.

Parameters:
  • name (str) – Metric key to look up in self.metrics.

  • default (Any, default None) – Value to return if the metric is not found.

Return type:

Any

Returns:

Any – The metric value, or default if not present.

Parameters:
__init__(explainer_name, dataset_name, model_name, X_cf, y_cf, success_mask, metrics, generation_times, metadata)
Parameters:
Return type:

None

BenchmarkResults

class tscf_eval.BenchmarkResults[source]

Bases: object

Container for all benchmark results with analysis methods.

Stores results indexed by (dataset, model, explainer) combinations. Provides methods for querying, aggregating, and exporting results.

Examples

>>> results = runner.run()
>>>
>>> # Get specific result
>>> result = results.get("GunPoint", "knn", "comte")
>>>
>>> # Get comparison DataFrame
>>> df = results.to_dataframe()
>>>
>>> # Iterate over results
>>> for result in results:
...     print(f"{result.dataset_name}/{result.model_name}: {result.metrics}")
add(result)[source]

Add a result to the collection.

Parameters:

result (ExplainerResult) – Result to store, keyed by (dataset, model, explainer).

Return type:

None

Parameters:

result (ExplainerResult)

get(dataset, model, explainer)[source]

Get the result for a specific (dataset, model, explainer) combination.

Parameters:
  • dataset (str) – Dataset name.

  • model (str) – Model name.

  • explainer (str) – Explainer name.

Return type:

ExplainerResult | None

Returns:

ExplainerResult or None – The matching result, or None if not found.

Parameters:
__iter__()[source]

Iterate over all stored results.

Return type:

Iterator[ExplainerResult]

Returns:

Iterator[ExplainerResult] – Iterator yielding each result.

__len__()[source]

Return the number of stored results.

Return type:

int

Returns:

int – Total number of (dataset, model, explainer) entries.

property datasets: list[str]

Return the sorted list of unique dataset names.

Returns:

list[str] – Unique dataset names across all stored results.

property models: list[str]

Return the sorted list of unique model names.

Returns:

list[str] – Unique model names across all stored results.

property explainers: list[str]

Return the sorted list of unique explainer names.

Returns:

list[str] – Unique explainer names across all stored results.

filter(datasets=None, models=None, explainers=None)[source]

Create a filtered copy of results.

Parameters:
  • datasets (list[str], optional) – Filter to these datasets. None means all.

  • models (list[str], optional) – Filter to these models. None means all.

  • explainers (list[str], optional) – Filter to these explainers. None means all.

Return type:

BenchmarkResults

Returns:

BenchmarkResults – New results containing only matching entries.

Parameters:
to_dataframe(metrics=None, include_timing=True)[source]

Convert results to a pandas DataFrame.

Parameters:
  • metrics (list[str], optional) – Specific metrics to include. None means all available.

  • include_timing (bool, default True) – Include timing columns (mean_time, total_time).

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns for dataset, model, explainer, and metrics.

Parameters:
aggregate(by='explainer', metrics=None, aggfunc='mean')[source]

Aggregate metrics across a dimension.

Parameters:
  • by (str, default "explainer") – Dimension to group by: “explainer”, “model”, or “dataset”.

  • metrics (list[str], optional) – Metrics to aggregate. None means all numeric.

  • aggfunc (str or list[str], default "mean") – Aggregation function(s): “mean”, “median”, “std”, “min”, “max”. When a list is provided (e.g. ["mean", "std"]), the returned DataFrame has a MultiIndex on columns with levels (metric, aggfunc).

Return type:

DataFrame

Returns:

pd.DataFrame – Aggregated results.

Parameters:
summary()[source]

Get summary statistics aggregated by explainer.

Return type:

DataFrame

Returns:

pd.DataFrame – Summary with mean metrics per explainer across all datasets/models.

to_dict()[source]

Convert to nested dictionary for serialization.

Return type:

dict[str, Any]

classmethod from_dict(data)[source]

Reconstruct from a dictionary produced by to_dict().

Only metrics and metadata are restored; counterfactual arrays (X_cf, y_cf) are not stored by to_dict and will be set to empty arrays.

Parameters:

data (dict) – Dictionary as returned by to_dict or loaded from JSON.

Return type:

BenchmarkResults

Returns:

BenchmarkResults

Parameters:

data (dict[str, Any])

__init__(_results=<factory>)
Parameters:

_results (dict[tuple[str, str, str], ExplainerResult])

Return type:

None

Pareto Analysis

ParetoAnalyzer

Multi-criteria Pareto analysis with visualization support.

class tscf_eval.ParetoAnalyzer[source]

Bases: object

Multi-objective Pareto dominance analysis.

Analyzes benchmark results to identify Pareto-optimal solutions and compute dominance rankings. Includes plotting utilities for Pareto fronts and cross-dataset consistency analysis.

Parameters:
  • metrics (list[str]) – Metric names to use for Pareto analysis.

  • directions (dict[str, Literal[``”min”, ``"max"]], optional) – Override metric directions. Keys are metric names, values are "min" or "max".

Examples

>>> from tscf_eval.benchmark import ParetoAnalyzer
>>>
>>> analyzer = ParetoAnalyzer(["validity", "proximity_l2", "mean_time_s"])
>>> ranking = analyzer.dominance_ranking(results)
>>> pareto_optimal = analyzer.pareto_front(results)
metrics: list[str]
directions: dict[str, Literal['min', 'max']]
__post_init__()[source]

Validate that at least one metric is provided.

Return type:

None

pareto_front(results, aggregate_by='explainer')[source]

Find Pareto-optimal (non-dominated) solutions.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

list[str]

Returns:

list[str] – Names of Pareto-optimal solutions.

Parameters:
dominance_count(results, aggregate_by='explainer')[source]

Count how many solutions each solution dominates.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

dict[str, int]

Returns:

dict[str, int] – Mapping from name to number of dominated solutions.

Parameters:
dominated_by_count(results, aggregate_by='explainer')[source]

Count how many solutions dominate each solution.

Lower is better (0 means Pareto-optimal).

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

dict[str, int]

Returns:

dict[str, int] – Mapping from name to number of dominating solutions.

Parameters:
dominance_ranking(results, aggregate_by='explainer')[source]

Compute dominance ranking table.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns name, dominated_by, dominates, pareto, plus one column per metric.

Parameters:
to_dataframe(results, aggregate_by='explainer')[source]

Get DataFrame with metric values and Pareto status.

Alias for dominance_ranking().

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with columns name, dominated_by, dominates, pareto, plus one column per metric.

Parameters:
plot_front(results, x_metric, y_metric, aggregate_by='explainer', ax=None, annotate=True, pareto_color='tab:blue', other_color='grey', pareto_marker='o', other_marker='x', title=None)[source]

Plot a 2-D Pareto front scatter.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • x_metric, y_metric (str) – Metrics for the x and y axes.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

  • ax (matplotlib Axes, optional) – Axes to plot on. Created if None.

  • annotate (bool, default True) – Label each point with the entity name.

  • pareto_color (str, default "tab:blue") – Color for Pareto-optimal points.

  • other_color (str, default "grey") – Color for dominated points.

  • pareto_marker (str, default "o") – Marker for Pareto-optimal points.

  • other_marker (str, default "x") – Marker for dominated points.

  • title (str, optional) – Plot title. Defaults to "Pareto Front".

Return type:

Axes

Returns:

matplotlib.axes.Axes

Parameters:
consistency(results_dict, aggregate_by='explainer')[source]

Compute cross-dataset Pareto consistency matrix.

For each dataset identifies the Pareto-optimal solutions and returns a boolean matrix (entity x dataset) with a count column.

Parameters:
  • results_dict (dict[str, BenchmarkResults]) – Mapping from dataset/group name to its benchmark results.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – Boolean DataFrame with entities as rows, datasets as columns, plus a count column. Sorted by count descending.

Parameters:
plot_consistency_heatmap(consistency_df, ax=None, cmap='YlGn', title=None)[source]

Plot Pareto consistency as a heatmap.

Parameters:
  • consistency_df (pd.DataFrame) – Output of consistency().

  • ax (matplotlib Axes, optional) – Axes to plot on. Created if None.

  • cmap (str, default "YlGn") – Matplotlib colormap name.

  • title (str, optional) – Plot title.

Return type:

Axes

Returns:

matplotlib.axes.Axes

Parameters:
  • consistency_df (DataFrame)

  • ax (Axes | None)

  • cmap (str)

  • title (str | None)

to_latex(results, aggregate_by='explainer', precision=3, caption=None, label=None)[source]

Generate a LaTeX table of the dominance ranking.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

  • precision (int, default 3) – Number of decimal places.

  • caption, label (str, optional) – LaTeX caption and label.

Return type:

str

Returns:

str – LaTeX table source code.

Parameters:
__init__(metrics, directions=<factory>)
Parameters:
Return type:

None

Visualization and Analysis Methods

ParetoAnalyzer provides these methods:

Method

Description

pareto_front()

Identify Pareto-optimal (non-dominated) solutions

dominance_ranking()

Full dominance ranking table with metric values

plot_front()

2D scatter plot showing Pareto front between two objectives

consistency()

Cross-dataset Pareto consistency matrix

plot_consistency_heatmap()

Heatmap of Pareto consistency across datasets

to_latex()

Generate LaTeX table of the dominance ranking

Examples:

from tscf_eval.benchmark import ParetoAnalyzer
import matplotlib.pyplot as plt

# Create analyzer with metric names (directions are inferred)
analyzer = ParetoAnalyzer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity", "efficiency_time_s",
])

# Identify Pareto-optimal methods
pareto_methods = analyzer.pareto_front(results)
print(f"Pareto-optimal: {pareto_methods}")

# Full dominance ranking table
ranking = analyzer.dominance_ranking(results)
print(ranking)

# 2D Pareto front plot
ax = analyzer.plot_front(
    results,
    x_metric="proximity_dtw",
    y_metric="validity_soft",
    annotate=True,
)
plt.savefig("pareto_front.png")

# Cross-dataset consistency analysis
results_by_dataset = {
    ds: results.filter(datasets=[ds])
    for ds in results.datasets
}
consistency_df = analyzer.consistency(results_by_dataset)
analyzer.plot_consistency_heatmap(consistency_df)
plt.savefig("consistency_heatmap.png")

# Export to LaTeX
latex = analyzer.to_latex(results, caption="Pareto Ranking", label="tab:pareto")

Weighted Scalarization

WeightedScalarizer

Min-max normalized weighted composite scoring for ranking methods.

class tscf_eval.WeightedScalarizer[source]

Bases: object

Min-max normalized weighted composite scoring.

Normalizes each metric to [0, 1] via min-max scaling (respecting metric directions so that higher normalized values are always better), then computes a weighted sum.

Parameters:
  • metrics (list[str]) – Metric names to include in the composite score.

  • weights (dict[str, float], optional) – Per-metric weights. Automatically normalized to sum to 1. If None, all metrics are weighted equally.

  • directions (dict[str, Literal[``”min”, ``"max"]], optional) – Override metric directions.

Examples

>>> scalarizer = WeightedScalarizer(
...     ["validity", "proximity_l2", "sparsity"],
...     weights={"validity": 2.0, "proximity_l2": 1.0, "sparsity": 1.0},
... )
>>> scores = scalarizer.score(results)
metrics: list[str]
weights: dict[str, float]
directions: dict[str, Literal['min', 'max']]
__post_init__()[source]

Validate metrics and normalize weights to sum to 1.

Return type:

None

score(results, aggregate_by='explainer')[source]

Compute weighted composite scores.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – DataFrame with normalized metric columns plus composite. Sorted by composite descending.

Parameters:
sensitivity(results, vary_metric, n_steps=11, aggregate_by='explainer')[source]

Sensitivity analysis by sweeping one metric’s weight.

Varies the weight of vary_metric from 0 to 1, redistributing the remaining weight proportionally among the other metrics.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • vary_metric (str) – The metric whose weight to sweep.

  • n_steps (int, default 11) – Number of weight values (0 to 1 inclusive).

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

Return type:

DataFrame

Returns:

pd.DataFrame – Long-format DataFrame with columns weight, <aggregate_by>, composite.

Parameters:
plot_sensitivity(sensitivity_df, aggregate_by='explainer', ax=None, title=None)[source]

Plot sensitivity analysis results.

Parameters:
  • sensitivity_df (pd.DataFrame) – Output of sensitivity().

  • aggregate_by (str, default "explainer") – Column name for the entity dimension.

  • ax (matplotlib Axes, optional) – Axes to plot on. Created if None.

  • title (str, optional) – Plot title.

Return type:

Axes

Returns:

matplotlib.axes.Axes

Parameters:
  • sensitivity_df (DataFrame)

  • aggregate_by (str)

  • ax (Axes | None)

  • title (str | None)

to_latex(results, aggregate_by='explainer', precision=3, caption=None, label=None)[source]

Generate a LaTeX table of weighted scores.

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • aggregate_by (str, default "explainer") – Dimension to aggregate by.

  • precision (int, default 3) – Number of decimal places.

  • caption, label (str, optional) – LaTeX caption and label.

Return type:

str

Returns:

str – LaTeX table source code.

Parameters:
__init__(metrics, weights=<factory>, directions=<factory>)
Parameters:
Return type:

None

Example:

from tscf_eval.benchmark import WeightedScalarizer

# Equal-weight composite across metrics
scalarizer = WeightedScalarizer(metrics=[
    "validity_soft", "proximity_dtw", "sparsity",
])
scores = scalarizer.score(results)

# Custom weights emphasizing validity
scalarizer = WeightedScalarizer(
    metrics=["validity_soft", "proximity_dtw", "sparsity"],
    weights={"validity_soft": 3.0, "proximity_dtw": 1.0, "sparsity": 1.0},
)

# Sensitivity analysis
sens_df = scalarizer.sensitivity(results, vary_metric="validity_soft", n_steps=11)
scalarizer.plot_sensitivity(sens_df)

Statistical Testing

friedman_test

tscf_eval.benchmark.friedman_test(results, metric, aggregate_by='explainer', group_by='dataset')[source]

Run a Friedman test comparing explainers across groups.

The Friedman test is a non-parametric test for detecting differences in treatments across multiple groups (e.g., explainers across datasets).

Parameters:
  • results (BenchmarkResults) – Benchmark results to analyze.

  • metric (str) – Metric name to test.

  • aggregate_by (str, default "explainer") – Treatments to compare (columns of the rank matrix).

  • group_by (str, default "dataset") – Blocking factor (rows of the rank matrix).

Return type:

FriedmanResult

Returns:

FriedmanResult – Named tuple with statistic, p_value, and rankings.

Raises:
  • ImportError – If scipy is not installed.

  • ValueError – If there are fewer than 3 treatments or fewer than 2 groups.

Parameters:

Example:

from tscf_eval.benchmark import friedman_test

fr = friedman_test(results, metric="validity_soft")
print(f"Statistic: {fr.statistic:.3f}, p-value: {fr.p_value:.4f}")
print(fr.rankings)

FriedmanResult

A NamedTuple with three fields:

  • statistic (float) – Friedman chi-squared statistic.

  • p_value (float) – p-value of the test.

  • rankings (pd.DataFrame) – Mean ranks per explainer for each metric.

LaTeX Table Generation

tscf_eval.benchmark.format_latex_table(df, directions=None, bold_best=True, arrows=True, precision=3, midrule_every=0, escape_underscores=True, caption=None, label=None)[source]

Format a DataFrame as a LaTeX table with best-value highlighting.

Parameters:
  • df (pd.DataFrame) – DataFrame to format. First column is typically the entity name (explainer/model), remaining columns are metrics.

  • directions (dict[str, bool], optional) – Mapping of column name to True if higher is better. If None, uses the built-in metric direction registry.

  • bold_best (bool, default True) – Bold the best value in each numeric column.

  • arrows (bool, default True) – Append directional arrows to column headers.

  • precision (int, default 3) – Number of decimal places for floats.

  • midrule_every (int, default 0) – Insert \midrule every N data rows. 0 means no midrules.

  • escape_underscores (bool, default True) – Replace _ with \_ in column headers and string cells.

  • caption (str, optional) – LaTeX table caption.

  • label (str, optional) – LaTeX table label for cross-referencing.

Return type:

str

Returns:

str – LaTeX table source code.

Parameters:
  • df (DataFrame)

  • directions (dict[str, bool] | None)

  • bold_best (bool)

  • arrows (bool)

  • precision (int)

  • midrule_every (int)

  • escape_underscores (bool)

  • caption (str | None)

  • label (str | None)