**6. Object tracking**

Tracking of handball players in video is an example of a Multi-Object Tracking (MOT) problem, where the goal is to track both the position and the identity of

**153**

same track.

the ID k.

*Application of Deep Learning Methods for Detection and Tracking of Players*

multiple objects present in video, so that the same unique ID is assigned to each object in every frame it appears. In an ideal case, in every video frame, all the present players should be detected in their correct position and a unique ID for each player, that stays the same throughout the video, should be assigned. This is a difficult task as many players can be on the field, from 14 to 25, depending if it is a practice or a match, and every one of them needs to be tracked. Furthermore, players can leave and re-enter the camera field of view, move very quickly, often change directions, occlude each other and wear similar clothes, or clothes with similar color

Thanks to improvements in performance of object detectors, and thanks to the ability to deal with challenges such as cluttered scenes or dynamics of tracked

When using tracking-by-detection, the tracking algorithm relies on the object detector to detect and locate the objects on the scene in each frame, while the role of the tracking algorithm itself is reduced to the problem of associating the detections across frames that belong to the same object. To do so, the tracker may use the information about bounding boxes obtained by object detection, such as their dimensions, the locations of their centroids, the relative position to the boxes in

The Hungarian algorithm [18] solves the problem of finding the globally optimal assignment of IDs to detected player bounding boxes, with respect to some cost function that is defined for an individual assignment. Here the cost function is defined only in terms of the parameters of the bounding boxes detected by the object detector in the current and previous frame, without using any visual features extracted from the video frames. Its value depends on the Euclidean distance of each detected object's bounding box centroid from the predicted centroid of an object in the track, and on the size difference of the bounding box and the last assigned bounding box to the

Formally, the assignment cost d , (*b k*) of assigning a bounding box b with the

(*b k wd C C w P P* ) ( *b k* ) ( ) *b k*

*w kN* ∈ ∈ 0,1 ,

where *Pk*<sup>−</sup>1 is the area of the last bounding box assigned to the track with

′ = . Moreover, a unique track ID is assigned to each detected bounding box whose detector confidence is higher than a set threshold during the initial assignment of bounding boxes to tracks. Afterwards, whenever the number of detected objects exceeds the number of currently active tracks, new tracks are created and initialized

An existing track is considered inactive when no detections are assigned to that track for several frames. Once a track is marked inactive, no further detections are added to it, so if an object later reappears or is detected again, it will get a new track

For the prediction of the centroid location *Ck*<sup>+</sup> ′

centroid in the track k is used, so *C C k k* + − 1 1

using the unassigned object's bounding box.

ID and will be considered a new object.

′ d, · , 1 = 2 1+ − +− − <sup>1</sup> (1)

′ <sup>+</sup>1 is:

<sup>1</sup> , last known position of object

centroid *Cb* and area *Pb* to the k-th track with the predicted centroid *Ck*

objects, tracking-by-detection has become a leading paradigm for MOT.

previous frames, or some visual features extracted from the image.

*DOI: http://dx.doi.org/10.5772/intechopen.96308*

**6.1 Hungarian assignment algorithm**

as the background [17].

#### *Application of Deep Learning Methods for Detection and Tracking of Players DOI: http://dx.doi.org/10.5772/intechopen.96308*

multiple objects present in video, so that the same unique ID is assigned to each object in every frame it appears. In an ideal case, in every video frame, all the present players should be detected in their correct position and a unique ID for each player, that stays the same throughout the video, should be assigned. This is a difficult task as many players can be on the field, from 14 to 25, depending if it is a practice or a match, and every one of them needs to be tracked. Furthermore, players can leave and re-enter the camera field of view, move very quickly, often change directions, occlude each other and wear similar clothes, or clothes with similar color as the background [17].

Thanks to improvements in performance of object detectors, and thanks to the ability to deal with challenges such as cluttered scenes or dynamics of tracked objects, tracking-by-detection has become a leading paradigm for MOT.

When using tracking-by-detection, the tracking algorithm relies on the object detector to detect and locate the objects on the scene in each frame, while the role of the tracking algorithm itself is reduced to the problem of associating the detections across frames that belong to the same object. To do so, the tracker may use the information about bounding boxes obtained by object detection, such as their dimensions, the locations of their centroids, the relative position to the boxes in previous frames, or some visual features extracted from the image.

## **6.1 Hungarian assignment algorithm**

*Deep Learning Applications*

**Table 2.**

*5.2.1 Player detection with mask R-CNN*

*Results of player detection with mask R-CNN and YOLOv3.*

the ground truth box was above 50%.

player and a background, illumination, etc.

considered.

**Figure 5.**

*and confidence value).*

The performance of the Mask R-CNN for player detection was tested on the PBD-Handball dataset using the standard Resnet-101-FPN network configuration with pre-trained parameters on the COCO dataset. For player detection experiment, only the bounding boxes that refer to the "person" class were

*Player detections obtained with mask R-CNN in a handball scene (bounding boxes with segmentation mask* 

**Object Detector/Measure Inference time / frame Recall Precision F1** YOLOv3 0.04 s 68% 95% 79% Mask R-CNN 0.3 s 76% 98% 85%

To obtain a good balance of high detection rates and low false positive detections, detections with confidence values below a threshold experimentally set to 0.55 were discarded. The detector performance was evaluated in terms of recall, precision, F1 scores and inference time per frame (using the NVIDIA 1080ti GPU). The results and comparison with the YOLOv3 detector are shown in **Table 2**. Detection was considered as true positive when the intersection of the detected bounding box and

One handball scene with the bounding boxes, class confidence value, and

It can be concluded that the results of both the YoloV3 and Mask R-CNN detector are good enough to be used for further analysis of player performance. However, the YOLOv3 detector is much faster, so it can be used not only for offline analysis of recordings, but also for real-time detection, at the cost of somewhat reduced recall. The detection results could be improved if more data is used, but the performance depends on the number and size of the players on the scene, the contrast between a

Tracking of handball players in video is an example of a Multi-Object Tracking (MOT) problem, where the goal is to track both the position and the identity of

segmentation masks obtained with Mask R-CNN is shown on **Figure 5**.

**152**

**6. Object tracking**

The Hungarian algorithm [18] solves the problem of finding the globally optimal assignment of IDs to detected player bounding boxes, with respect to some cost function that is defined for an individual assignment. Here the cost function is defined only in terms of the parameters of the bounding boxes detected by the object detector in the current and previous frame, without using any visual features extracted from the video frames. Its value depends on the Euclidean distance of each detected object's bounding box centroid from the predicted centroid of an object in the track, and on the size difference of the bounding box and the last assigned bounding box to the same track.

Formally, the assignment cost d , (*b k*) of assigning a bounding box b with the centroid *Cb* and area *Pb* to the k-th track with the predicted centroid *Ck* ′ <sup>+</sup>1 is:

$$\mathbf{d}\left(b,k\right) = w \cdot d\_2\left(C\_b, C\_{k+1}\right) + \left(\mathbf{1} - w\right)\left|P\_b - P\_{k-1}\right|\tag{1}$$

$$w \in \left[\mathbf{0}, \mathbf{1}\right], k \in N$$

where *Pk*<sup>−</sup>1 is the area of the last bounding box assigned to the track with the ID k.

For the prediction of the centroid location *Ck*<sup>+</sup> ′ <sup>1</sup> , last known position of object centroid in the track k is used, so *C C k k* + − 1 1 ′ = .

Moreover, a unique track ID is assigned to each detected bounding box whose detector confidence is higher than a set threshold during the initial assignment of bounding boxes to tracks. Afterwards, whenever the number of detected objects exceeds the number of currently active tracks, new tracks are created and initialized using the unassigned object's bounding box.

An existing track is considered inactive when no detections are assigned to that track for several frames. Once a track is marked inactive, no further detections are added to it, so if an object later reappears or is detected again, it will get a new track ID and will be considered a new object.
