sleap.nn.evals#

Evaluation utilities for measuring pose estimation accuracy.

To generate metrics, you’ll need two Labels datasets, one with ground truth data and one with predicted data. The video paths in the datasets must match. Load both datasets and call evaluate, like so:

> labels_gt = Labels.load_file("path/to/ground/truth.slp")
> labels_pr = Labels.load_file("path/to/predictions.slp")
> metrics = evaluate(labels_gt, labels_pr)

evaluate returns a dictionary, keys are strings which name the metric, values are either floats or numpy arrays.

A good place to start if you want to understand how well your models are performing is to look at:

oks_voc.mAP

vis.precision

vis.recall

dist.p95

sleap.nn.evals.compute_dist_metrics(dists_dict: Dict[str, Union[ndarray, List[Instance]]]) → Dict[str, ndarray][source]#

Compute the Euclidean distance error at different percentiles.

Parameters:: dists – An array of pairwise distances of shape (n_positive_pairs, n_nodes).
Returns:: A dictionary of distance metrics.

sleap.nn.evals.compute_dists(positive_pairs: List[Tuple[Instance, PredictedInstance, Any]]) → Dict[str, Union[ndarray, List[int], List[str]]][source]#

Compute Euclidean distances between matched pairs of instances.

Parameters:: positive_pairs – A list of tuples of the form (instance_gt, instance_pr, _) containing the matched pair of instances.
Returns:: dists: An array of pairwise distances of shape (n_positive_pairs, n_nodes) frame_idxs: A list of frame indices corresponding to the dists video_paths: A list of video paths corresponding to the dists
Return type:: A dictionary with the following keys

sleap.nn.evals.compute_generalized_voc_metrics(positive_pairs: List[Tuple[Instance, PredictedInstance, Any]], false_negatives: List[Instance], match_scores: List[float], match_score_thresholds: ndarray = array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), recall_thresholds: ndarray = array([0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.0]), name: str = 'gvoc') → Dict[str, Any][source]#

Compute VOC metrics given matched pairs of instances.

Parameters:

positive_pairs – A list of tuples of the form (instance_gt, instance_pr, _) containing the matched pair of instances.
false_negatives – A list of unmatched instances.
match_scores – The score obtained in the matching procedure for each matched pair (e.g., OKS).
match_score_thresholds – Score thresholds at which to consider matches as a true positive match.
recall_thresholds – Recall thresholds at which to evaluate Average Precision.
name – Name to use to prefix returned metric keys.

Returns:

A dictionary of VOC metrics.

sleap.nn.evals.compute_instance_area(points: ndarray) → ndarray[source]#

Compute the area of the bounding box of a set of keypoints.

Parameters:: points – A numpy array of coordinates.
Returns:: The area of the bounding box of the points.

sleap.nn.evals.compute_oks(points_gt: ndarray, points_pr: ndarray, scale: Optional[float] = None, stddev: float = 0.025, use_cocoeval: bool = True) → ndarray[source]#

Compute the object keypoints similarity between sets of points.

Parameters:

points_gt – Ground truth instances of shape (n_gt, n_nodes, n_ed), where n_nodes is the number of body parts/keypoint types, and n_ed is the number of Euclidean dimensions (typically 2 or 3). Keypoints that are missing/not visible should be represented as NaNs.
points_pr – Predicted instance of shape (n_pr, n_nodes, n_ed).
use_cocoeval – Indicates whether the OKS score is calculated like cocoeval method or not. True indicating the score is calculated using the cocoeval method (widely used and the code can be found here at cocodataset/cocoapi) and False indicating the score is calculated using the method exactly as given in the paper referenced in the Notes below.
scale – Size scaling factor to use when weighing the scores, typically the area of the bounding box of the instance (in pixels). This should be of the length n_gt. If a scalar is provided, the same number is used for all ground truth instances. If set to None, the bounding box area of the ground truth instances will be calculated.
stddev – The standard deviation associated with the spread in the localization accuracy of each node/keypoint type. This should be of the length n_nodes. “Easier” keypoint types will have lower values to reflect the smaller spread expected in localizing it.

Returns:

The object keypoints similarity between every pair of ground truth and predicted instance, a numpy array of of shape (n_gt, n_pr) in the range of [0, 1.0], with 1.0 denoting a perfect match.

Notes

It’s important to set the stddev appropriately when accounting for the difficulty of each keypoint type. For reference, the median value for all keypoint types in COCO is 0.072. The “easiest” keypoint is the left eye, with stddev of 0.025, since it is easy to precisely locate the eyes when labeling. The “hardest” keypoint is the left hip, with stddev of 0.107, since it’s hard to locate the left hip bone without external anatomical features and since it is often occluded by clothing.

The implementation here is based off of the descriptions in: Ronch & Perona. “Benchmarking and Error Diagnosis in Multi-Instance Pose Estimation.” ICCV (2017).

sleap.nn.evals.compute_pck_metrics(dists: ndarray, thresholds: ndarray = array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0])) → Dict[str, ndarray][source]#

Compute PCK across a range of thresholds.

Parameters:

dists – An array of pairwise distances of shape (n_positive_pairs, n_nodes).
thresholds – A list of distance thresholds in pixels.

Returns:

A dictionary of PCK metrics evaluated at each threshold.

sleap.nn.evals.compute_visibility_conf(positive_pairs: List[Tuple[Instance, Instance, Any]]) → Dict[str, float][source]#

Compute node visibility metrics.

Parameters:: positive_pairs – A list of tuples of the form (instance_gt, instance_pr, _) containing the matched pair of instances.
Returns:: A dictionary of visibility metrics, including the confusion matrix.

sleap.nn.evals.evaluate(labels_gt: Labels, labels_pr: Labels, oks_stddev: float = 0.025, oks_scale: Optional[float] = None, match_threshold: float = 0, user_labels_only: bool = True) → Dict[str, Union[float, ndarray]][source]#

Calculate all metrics from ground truth and predicted labels.

Parameters:

labels_gt – The Labels dataset object with ground truth labels.
labels_pr – The Labels dataset object with predicted labels.
oks_stddev – The standard deviation to use for calculating object keypoint similarity; see compute_oks function for details.
oks_scale – The scale to use for calculating object keypoint similarity; see compute_oks function for details.
match_threshold – The threshold to use on oks scores when determining which instances match between ground truth and predicted frames.
user_labels_only – If False, predicted instances in the ground truth frame may be considered for matching.

Returns:

Dict, keys are strings, values are metrics (floats or ndarrays).

sleap.nn.evals.evaluate_model(cfg: TrainingJobConfig, labels_gt: Union[LabelsReader, Labels], model: Model, save: bool = True, split_name: str = 'test') → Tuple[Labels, Dict[str, Any]][source]#

Evaluate a trained model and save metrics and predictions.

Parameters:

cfg – The TrainingJobConfig associated with the model.
labels_gt – A LabelsReader pipeline generator that reads the ground truth data to evaluate or a Labels object to be used as ground truth.
model – The sleap.nn.model.Model instance to evaluate.
save – If True, save the predictions and metrics to the model folder.
split_name – String name to append to the saved filenames.

Returns:

A tuple of (labels_pr, metrics).

labels_pr will contain the predicted labels.

metrics will contain the evaluated metrics given the predictions, or None if the metrics failed to be computed.

sleap.nn.evals.find_frame_pairs(labels_gt: Labels, labels_pr: Labels, user_labels_only: bool = True) → List[Tuple[LabeledFrame, LabeledFrame]][source]#

Find corresponding frames across two sets of labels.

Parameters:

labels_gt – A sleap.Labels instance with ground truth instances.
labels_pr – A sleap.Labels instance with predicted instances.
user_labels_only – If False, frames with predicted instances in labels_gt will also be considered for matching.

Returns:

A list of pairs of sleap.LabeledFrame`s in the form `(frame_gt, frame_pr).

sleap.nn.evals.load_metrics(model_path: str, split: str = 'val') → Dict[str, Any][source]#

Load metrics for a model.

Parameters:

model_path – Path to a model folder or metrics file (.npz).
split – Name of the split to load the metrics for. Must be "train", "val" or "test" (default: "val"). Ignored if a path to a metrics NPZ file is provided.

Returns:

"vis.tp": Visibility - True Positives
"vis.fp": Visibility - False Positives
"vis.tn": Visibility - True Negatives
"vis.fn": Visibility - False Negatives
"vis.precision": Visibility - Precision
"vis.recall": Visibility - Recall
"dist.avg": Average Distance (ground truth vs prediction)
"dist.p50": Distance for 50th percentile
"dist.p75": Distance for 75th percentile
"dist.p90": Distance for 90th percentile
"dist.p95": Distance for 95th percentile
"dist.p99": Distance for 99th percentile
"dist.dists": All distances
"dist.frame_idxs": Frame indices corresponding to "dist.dists"
"dist.video_paths": Video paths corresponding to "dist.dists"
"pck.mPCK": Mean Percentage of Correct Keypoints (PCK)
"oks.mOKS": Mean Object Keypoint Similarity (OKS)
"oks_voc.mAP": VOC with OKS scores - mean Average Precision (mAP)
"oks_voc.mAR": VOC with OKS scores - mean Average Recall (mAR)
"pck_voc.mAP": VOC with PCK scores - mean Average Precision (mAP)
"pck_voc.mAR": VOC with PCK scores - mean Average Recall (mAR)

Return type:

The loaded metrics as a dictionary with keys

sleap.nn.evals.match_frame_pairs(frame_pairs: List[Tuple[LabeledFrame, LabeledFrame]], stddev: float = 0.025, scale: Optional[float] = None, threshold: float = 0, user_labels_only: bool = True) → Tuple[List[Tuple[Instance, PredictedInstance, float]], List[Instance]][source]#

Match all ground truth and predicted instances within each pair of frames.

This is a wrapper for match_instances() but operates on lists of frames.

Parameters:

frame_pairs – A list of pairs of sleap.LabeledFrame`s in the form `(frame_gt, frame_pr). These can be obtained with find_frame_pairs().
stddev – The expected spread of coordinates for OKS computation.
scale – The scale for normalizing the OKS. If not set, the bounding box area will be used.
threshold – The minimum OKS between a candidate pair of instances to be considered a match.
user_labels_only – If False, predicted instances in the ground truth frame may be considered for matching.

Returns:

A tuple of (positive_pairs, false_negatives).

positive_pairs is a list of 3-tuples of the form (instance_gt, instance_pr, oks) containing the matched pair of instances and their OKS.

false_negatives is a list of ground truth `sleap.Instance`s that could not be matched.

sleap.nn.evals.match_instances(frame_gt: LabeledFrame, frame_pr: LabeledFrame, stddev: float = 0.025, scale: Optional[float] = None, threshold: float = 0, user_labels_only: bool = True) → Tuple[List[Tuple[Instance, PredictedInstance, float]], List[Instance]][source]#

Match pairs of instances between ground truth and predictions in a frame.

Parameters:

frame_gt – A sleap.LabeledFrame with ground truth instances.
frame_pr – A sleap.LabeledFrame with predicted instances.
stddev – The expected spread of coordinates for OKS computation.
scale – The scale for normalizing the OKS. If not set, the bounding box area will be used.
threshold – The minimum OKS between a candidate pair of instances to be considered a match.
user_labels_only – If False, predicted instances in the ground truth frame may be considered for matching.

Returns:

A tuple of (positive_pairs, false_negatives).

positive_pairs is a list of 3-tuples of the form (instance_gt, instance_pr, oks) containing the matched pair of instances and their OKS.

false_negatives is a list of ground truth `sleap.Instance`s that could not be matched.

Notes

This function uses the approach from the PASCAL VOC scoring procedure. Briefly, predictions are sorted descending by their instance-level prediction scores and greedily matched to ground truth instances which are then removed from the pool of available instances.

Ground truth instances that remain unmatched are considered false negatives.

sleap.nn.evals.replace_path(video_list: List[dict], new_paths: List[str])[source]#: Replace video paths in unstructured video objects.

sleap.nn.evals

Contents

sleap.nn.evals#