**6.2 Deep SORT**

Deep SORT [19] is a tracking algorithm that builds upon the Hungarian algorithm, adding the appearance information about the tracked objects into consideration when associating new detections with previously tracked objects. The appearance information is particularly useful for re-identifying players that were occluded or have temporarily left the scene. As in the previous case, a unique track ID is assigned to each bounding box within the first frame, and the Hungarian algorithm is used to assign the new detections to existing tracks so that the assignment cost function reaches the global minimum.

The cost function consists of the spatial distance ( ) *d* <sup>1</sup> in form of Mahalanobis distance of the detected bounding box from its position predicted according to its last known position, and a visual distance ( ) *d* <sup>2</sup> that compares the appearance of the detected object with the history of appearances of the tracked object. Formally, the cost function *i j c*, of assigning a detected object *j* to a track *i* is given by:

$$c\_{i,j} = \lambda d^{(1)}\left(i,j\right) + \left(\mathbf{1} - \lambda\right)d^{(2)}\left(i,j\right) \tag{2}$$

where λ is a tunable parameter that determines the relative influence of the spatial distance ( ) *d* <sup>1</sup> and the visual distance ( ) *d* <sup>2</sup> .

The spatial distance ( ) *d* <sup>1</sup> is given by the expression:

$$d^{(1)}\left(i,j\right) = \left(d\_j - \wp\_i\right)^T \mathbb{S}\_i^{-1} \left(d\_j - \wp\_i\right) \tag{3}$$

where *<sup>i</sup> y* and *Si* represent the mean and the covariance matrix of bounding box observations for the ith track, and *dj* is the jth detected bounding box.

The visual distance ( ) *d* <sup>2</sup> is is given by the expression:

$$d^{(2)}\left(i,j\right) = \min\left\{\mathbf{1} - r\_j^T r\_k^{(i)} | r\_k^{(i)} \in \mathcal{R}\_i\right\},\tag{4}$$

**155**

**Figure 6.**

*right frames are 1 second apart.*

*Application of Deep Learning Methods for Detection and Tracking of Players*

MOT evaluation measures. The results are shown in **Table 3**.

can be correctly identified for 24,7% of the time.

The tracking of players previously detected with the YOLOv3 detector using pre-trained tiny-yolo weights and confidence threshold set to 0.5 was performed on the PT-Handball dataset with the Hungarian algorithm and Deep SORT [20].

An example of a tracking situation in sequential frames when occlusions occurred is shown in **Figure 6**. The numbers above the bounding boxes represent the tracking ID of each player. The shown situation is quite demanding, resulting in rather unstable and inconsistent tracking in the selected frame. The Hungarian algorithm successfully tracked one of four players, and two of the players got new IDs after occlusion, while one player has switched ID with another, so that 807 got previously existing ID 812 (**Figure 6**, top row). Deep SORT managed to track correctly all four

Since the best results were obtained with the Deep SORT algorithm, its ability to assign the correct IDs to detections is analyzed in more detail using the common

For each player that should be tracked, the identity switches caused, 5 to 6 additional tracks on average, so there are 5 times more tracks than in groundtruth annotated data. Furthermore, a large number, precisely 1483, of identity switches are present, due to a relatively large number of players in the video that move fast, exit the camera field view, frequently change positions, and occlude

The number of players that are simultaneously present in the frame obviously affects the tracking performance, and according to the IDF1 measure, the players

Tracking mistakes can be attributed to several factors. As in all tracking-bydetection algorithms, the accuracy of tracking is greatly influenced by the accuracy of the object detector. If a player is inaccurately detected, the tracking will be inaccurate as well. Furthermore, the scale of an object, occlusion, and the similar color of the players' clothes with the background often cause tracking problems. To overcome these problems, in further work, multiple camera systems will be investigated [21], which can also allow for a more robust generation of a top-view

*An example of tracking situation with occlusion. Top: Hungarian algorithm, bottom: Deep SORT. The left and* 

*DOI: http://dx.doi.org/10.5772/intechopen.96308*

**6.3 The player tracking experiment**

players (**Figure 6**, bottom row).

each other.

trajectory [22].

where *<sup>j</sup> r* is the appearance descriptor extracted from within the jth detected bounding box, and *<sup>i</sup>* is the set of the last 100 appearance descriptors *<sup>i</sup> kr* associated with the ith track.

The ( ) *d* <sup>2</sup> measure uses the cosine distance between the jth detection and a number of detections already assigned to ith track, so if a visually similar detection was previously seen, the distance will be low.

The appearance descriptors are extracted using a wide residual neural network comprising two convolutional layers followed by six residual blocks that output a 128-element vector, and then normalized to fit within a unit hypersphere so that the cosine distance can be used. The network was pre-trained on a person re-identification dataset of more than a million images of 1261 pedestrians. The appearance information helps with re-identification of objects that have not been tracked for some time because of missed detections, because they were under occlusion or because they have briefly left the scene.

New tracks are formed whenever there are more detections in a frame than there are existing tracks or when a detection cannot be assigned to any track, because its spatial or visual distance is too far from any existing track. The maximum allowed ( ) *<sup>d</sup>* <sup>1</sup> and ( ) *<sup>d</sup>* <sup>2</sup> distance when an assignment is still possible is set with tunable thresholds.
