sleap.nn.evals#
Evaluation utilities for measuring pose estimation accuracy.
To generate metrics, you’ll need two Labels
datasets, one with ground truth
data and one with predicted data. The video paths in the datasets must match.
Load both datasets and call evaluate
, like so:
> labels_gt = Labels.load_file("path/to/ground/truth.slp")
> labels_pr = Labels.load_file("path/to/predictions.slp")
> metrics = evaluate(labels_gt, labels_pr)
evaluate
returns a dictionary, keys are strings which name the metric,
values are either floats or numpy arrays.
A good place to start if you want to understand how well your models are performing is to look at:
oks_voc.mAP
vis.precision
vis.recall
dist.p95
- sleap.nn.evals.compute_dist_metrics(dists: numpy.ndarray) Dict[str, numpy.ndarray] [source]#
Compute the Euclidean distance error at different percentiles.
- Parameters
dists – An array of pairwise distances of shape
(n_positive_pairs, n_nodes)
.- Returns
A dictionary of distance metrics.
- sleap.nn.evals.compute_dists(positive_pairs: List[Tuple[sleap.instance.Instance, sleap.instance.PredictedInstance, Any]]) numpy.ndarray [source]#
Compute Euclidean distances between matched pairs of instances.
- Parameters
positive_pairs – A list of tuples of the form
(instance_gt, instance_pr, _)
containing the matched pair of instances.- Returns
An array of pairwise distances of shape
(n_positive_pairs, n_nodes)
.
- sleap.nn.evals.compute_generalized_voc_metrics(positive_pairs: List[Tuple[sleap.instance.Instance, sleap.instance.PredictedInstance, Any]], false_negatives: List[sleap.instance.Instance], match_scores: List[float], match_score_thresholds: numpy.ndarray = array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), recall_thresholds: numpy.ndarray = array([0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.0]), name: str = 'gvoc') Dict[str, Any] [source]#
Compute VOC metrics given matched pairs of instances.
- Parameters
positive_pairs – A list of tuples of the form
(instance_gt, instance_pr, _)
containing the matched pair of instances.false_negatives – A list of unmatched instances.
match_scores – The score obtained in the matching procedure for each matched pair (e.g., OKS).
match_score_thresholds – Score thresholds at which to consider matches as a true positive match.
recall_thresholds – Recall thresholds at which to evaluate Average Precision.
name – Name to use to prefix returned metric keys.
- Returns
A dictionary of VOC metrics.
- sleap.nn.evals.compute_instance_area(points: numpy.ndarray) numpy.ndarray [source]#
Compute the area of the bounding box of a set of keypoints.
- Parameters
points – A numpy array of coordinates.
- Returns
The area of the bounding box of the points.
- sleap.nn.evals.compute_oks(points_gt: numpy.ndarray, points_pr: numpy.ndarray, scale: Optional[float] = None, stddev: float = 0.025) numpy.ndarray [source]#
Compute the object keypoints similarity between sets of points.
- Parameters
points_gt – Ground truth instances of shape (n_gt, n_nodes, n_ed), where n_nodes is the number of body parts/keypoint types, and n_ed is the number of Euclidean dimensions (typically 2 or 3). Keypoints that are missing/not visible should be represented as NaNs.
points_pr – Predicted instance of shape (n_pr, n_nodes, n_ed).
scale – Size scaling factor to use when weighing the scores, typically the area of the bounding box of the instance (in pixels). This should be of the length n_gt. If a scalar is provided, the same number is used for all ground truth instances. If set to None, the bounding box area of the ground truth instances will be calculated.
stddev – The standard deviation associated with the spread in the localization accuracy of each node/keypoint type. This should be of the length n_nodes. “Easier” keypoint types will have lower values to reflect the smaller spread expected in localizing it.
- Returns
The object keypoints similarity between every pair of ground truth and predicted instance, a numpy array of of shape (n_gt, n_pr) in the range of [0, 1.0], with 1.0 denoting a perfect match.
Notes
It’s important to set the stddev appropriately when accounting for the difficulty of each keypoint type. For reference, the median value for all keypoint types in COCO is 0.072. The “easiest” keypoint is the left eye, with stddev of 0.025, since it is easy to precisely locate the eyes when labeling. The “hardest” keypoint is the left hip, with stddev of 0.107, since it’s hard to locate the left hip bone without external anatomical features and since it is often occluded by clothing.
The implementation here is based off of the descriptions in: Ronch & Perona. “Benchmarking and Error Diagnosis in Multi-Instance Pose Estimation.” ICCV (2017).
- sleap.nn.evals.compute_pck_metrics(dists: numpy.ndarray, thresholds: numpy.ndarray = array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])) Dict[str, numpy.ndarray] [source]#
Compute PCK across a range of thresholds.
- Parameters
dists – An array of pairwise distances of shape
(n_positive_pairs, n_nodes)
.thresholds – A list of distance thresholds in pixels.
- Returns
A dictionary of PCK metrics evaluated at each threshold.
- sleap.nn.evals.compute_visibility_conf(positive_pairs: List[Tuple[sleap.instance.Instance, sleap.instance.Instance, Any]]) Dict[str, float] [source]#
Compute node visibility metrics.
- Parameters
positive_pairs – A list of tuples of the form
(instance_gt, instance_pr, _)
containing the matched pair of instances.- Returns
A dictionary of visibility metrics, including the confusion matrix.
- sleap.nn.evals.evaluate(labels_gt: sleap.io.dataset.Labels, labels_pr: sleap.io.dataset.Labels, oks_stddev: float = 0.025, oks_scale: Optional[float] = None, match_threshold: float = 0, user_labels_only: bool = True) Dict[str, Union[float, numpy.ndarray]] [source]#
Calculate all metrics from ground truth and predicted labels.
- Parameters
labels_gt – The
Labels
dataset object with ground truth labels.labels_pr – The
Labels
dataset object with predicted labels.oks_stddev – The standard deviation to use for calculating object keypoint similarity; see
compute_oks
function for details.oks_scale – The scale to use for calculating object keypoint similarity; see
compute_oks
function for details.match_threshold – The threshold to use on oks scores when determining which instances match between ground truth and predicted frames.
user_labels_only – If False, predicted instances in the ground truth frame may be considered for matching.
- Returns
Dict, keys are strings, values are metrics (floats or ndarrays).
- sleap.nn.evals.evaluate_model(cfg: sleap.nn.config.training_job.TrainingJobConfig, labels_reader: sleap.nn.data.providers.LabelsReader, model: sleap.nn.model.Model, save: bool = True, split_name: str = 'test') Tuple[sleap.io.dataset.Labels, Dict[str, Any]] [source]#
Evaluate a trained model and save metrics and predictions.
- Parameters
cfg – The
TrainingJobConfig
associated with the model.labels_reader – A
LabelsReader
pipeline generator that reads the ground truth data to evaluate.model – The
sleap.nn.model.Model
instance to evaluate.save – If True, save the predictions and metrics to the model folder.
split_name – String name to append to the saved filenames.
- Returns
A tuple of
(labels_pr, metrics)
.labels_pr
will contain the predicted labels.metrics
will contain the evaluated metrics given the predictions, or None if the metrics failed to be computed.
- sleap.nn.evals.find_frame_pairs(labels_gt: sleap.io.dataset.Labels, labels_pr: sleap.io.dataset.Labels, user_labels_only: bool = True) List[Tuple[sleap.instance.LabeledFrame, sleap.instance.LabeledFrame]] [source]#
Find corresponding frames across two sets of labels.
- Parameters
labels_gt – A
sleap.Labels
instance with ground truth instances.labels_pr – A
sleap.Labels
instance with predicted instances.user_labels_only – If False, frames with predicted instances in
labels_gt
will also be considered for matching.
- Returns
A list of pairs of
sleap.LabeledFrame`s in the form `(frame_gt, frame_pr)
.
- sleap.nn.evals.load_metrics(model_path: str, split: str = 'val') Dict[str, Any] [source]#
Load metrics for a model.
- Parameters
model_path – Path to a model folder or metrics file (.npz).
split – Name of the split to load the metrics for. Must be
"train"
,"val"
or"test"
(default:"val"
). Ignored if a path to a metrics NPZ file is provided.
- Returns
"vis.tp"
: Visibility - True Positives"vis.fp"
: Visibility - False Positives"vis.tn"
: Visibility - True Negatives"vis.fn"
: Visibility - False Negatives"vis.precision"
: Visibility - Precision"vis.recall"
: Visibility - Recall"dist.avg"
: Average Distance (ground truth vs prediction)"dist.p50"
: Distance for 50th percentile"dist.p75"
: Distance for 75th percentile"dist.p90"
: Distance for 90th percentile"dist.p95"
: Distance for 95th percentile"dist.p99"
: Distance for 99th percentile"dist.dists"
: All distances"pck.mPCK"
: Mean Percentage of Correct Keypoints (PCK)"oks.mOKS"
: Mean Object Keypoint Similarity (OKS)"oks_voc.mAP"
: VOC with OKS scores - mean Average Precision (mAP)"oks_voc.mAR"
: VOC with OKS scores - mean Average Recall (mAR)"pck_voc.mAP"
: VOC with PCK scores - mean Average Precision (mAP)"pck_voc.mAR"
: VOC with PCK scores - mean Average Recall (mAR)
- Return type
The loaded metrics as a dictionary with keys
- sleap.nn.evals.match_frame_pairs(frame_pairs: List[Tuple[sleap.instance.LabeledFrame, sleap.instance.LabeledFrame]], stddev: float = 0.025, scale: Optional[float] = None, threshold: float = 0, user_labels_only: bool = True) Tuple[List[Tuple[sleap.instance.Instance, sleap.instance.PredictedInstance, float]], List[sleap.instance.Instance]] [source]#
Match all ground truth and predicted instances within each pair of frames.
This is a wrapper for
match_instances()
but operates on lists of frames.- Parameters
frame_pairs – A list of pairs of
sleap.LabeledFrame`s in the form `(frame_gt, frame_pr)
. These can be obtained withfind_frame_pairs()
.stddev – The expected spread of coordinates for OKS computation.
scale – The scale for normalizing the OKS. If not set, the bounding box area will be used.
threshold – The minimum OKS between a candidate pair of instances to be considered a match.
user_labels_only – If False, predicted instances in the ground truth frame may be considered for matching.
- Returns
A tuple of (
positive_pairs
,false_negatives
).positive_pairs
is a list of 3-tuples of the form(instance_gt, instance_pr, oks)
containing the matched pair of instances and their OKS.false_negatives
is a list of ground truth `sleap.Instance`s that could not be matched.
- sleap.nn.evals.match_instances(frame_gt: sleap.instance.LabeledFrame, frame_pr: sleap.instance.LabeledFrame, stddev: float = 0.025, scale: Optional[float] = None, threshold: float = 0, user_labels_only: bool = True) Tuple[List[Tuple[sleap.instance.Instance, sleap.instance.PredictedInstance, float]], List[sleap.instance.Instance]] [source]#
Match pairs of instances between ground truth and predictions in a frame.
- Parameters
frame_gt – A
sleap.LabeledFrame
with ground truth instances.frame_pr – A
sleap.LabeledFrame
with predicted instances.stddev – The expected spread of coordinates for OKS computation.
scale – The scale for normalizing the OKS. If not set, the bounding box area will be used.
threshold – The minimum OKS between a candidate pair of instances to be considered a match.
user_labels_only – If False, predicted instances in the ground truth frame may be considered for matching.
- Returns
A tuple of (
positive_pairs
,false_negatives
).positive_pairs
is a list of 3-tuples of the form(instance_gt, instance_pr, oks)
containing the matched pair of instances and their OKS.false_negatives
is a list of ground truth `sleap.Instance`s that could not be matched.
Notes
This function uses the approach from the PASCAL VOC scoring procedure. Briefly, predictions are sorted descending by their instance-level prediction scores and greedily matched to ground truth instances which are then removed from the pool of available instances.
Ground truth instances that remain unmatched are considered false negatives.