**6.3 The player tracking experiment**

*Deep Learning Applications*

Deep SORT [19] is a tracking algorithm that builds upon the Hungarian algorithm, adding the appearance information about the tracked objects into consideration when associating new detections with previously tracked objects. The appearance information is particularly useful for re-identifying players that were occluded or have temporarily left the scene. As in the previous case, a unique track ID is assigned to each bounding box within the first frame, and the Hungarian algorithm is used to assign the new detections to existing tracks so that the assign-

The cost function consists of the spatial distance ( ) *d* <sup>1</sup> in form of Mahalanobis distance of the detected bounding box from its position predicted according to its last known position, and a visual distance ( ) *d* <sup>2</sup> that compares the appearance of the detected object with the history of appearances of the tracked object. Formally, the

( ) ( ) ( ) *c d ij d ij i j* = +−

where λ is a tunable parameter that determines the relative influence of the

( ) ( ) ( ) *<sup>T</sup> d ij d y S d y ji i ji*

where *<sup>i</sup> y* and *Si* represent the mean and the covariance matrix of bounding box

( ) ( ) ( ) { } *<sup>T</sup> i i d ij rr r* =− ∈ *jk k i*

where *<sup>j</sup> r* is the appearance descriptor extracted from within the jth detected

The ( ) *d* <sup>2</sup> measure uses the cosine distance between the jth detection and a number of detections already assigned to ith track, so if a visually similar detection

The appearance descriptors are extracted using a wide residual neural network comprising two convolutional layers followed by six residual blocks that output a 128-element vector, and then normalized to fit within a unit hypersphere so that the cosine distance can be used. The network was pre-trained on a person re-identification dataset of more than a million images of 1261 pedestrians. The appearance information helps with re-identification of objects that have not been tracked for some time because of missed detections, because they were under occlusion or

New tracks are formed whenever there are more detections in a frame than there are existing tracks or when a detection cannot be assigned to any track, because its spatial or visual distance is too far from any existing track. The maximum allowed

( ) *<sup>d</sup>* <sup>1</sup> and ( ) *<sup>d</sup>* <sup>2</sup> distance when an assignment is still possible is set with tunable thresholds.

 λ( ) 1 2

, ,1 , (2)

<sup>−</sup> =− − <sup>1</sup> <sup>1</sup> , (3)

<sup>2</sup> , min 1 | , (4)

*kr* associated

cost function *i j c*, of assigning a detected object *j* to a track *i* is given by:

( )

observations for the ith track, and *dj* is the jth detected bounding box.

bounding box, and *<sup>i</sup>* is the set of the last 100 appearance descriptors *<sup>i</sup>*

λ

ment cost function reaches the global minimum.

spatial distance ( ) *d* <sup>1</sup> and the visual distance ( ) *d* <sup>2</sup> .

The spatial distance ( ) *d* <sup>1</sup> is given by the expression:

( )

The visual distance ( ) *d* <sup>2</sup> is is given by the expression:

( )

was previously seen, the distance will be low.

because they have briefly left the scene.

**6.2 Deep SORT**

**154**

with the ith track.

The tracking of players previously detected with the YOLOv3 detector using pre-trained tiny-yolo weights and confidence threshold set to 0.5 was performed on the PT-Handball dataset with the Hungarian algorithm and Deep SORT [20].

An example of a tracking situation in sequential frames when occlusions occurred is shown in **Figure 6**. The numbers above the bounding boxes represent the tracking ID of each player. The shown situation is quite demanding, resulting in rather unstable and inconsistent tracking in the selected frame. The Hungarian algorithm successfully tracked one of four players, and two of the players got new IDs after occlusion, while one player has switched ID with another, so that 807 got previously existing ID 812 (**Figure 6**, top row). Deep SORT managed to track correctly all four players (**Figure 6**, bottom row).

Since the best results were obtained with the Deep SORT algorithm, its ability to assign the correct IDs to detections is analyzed in more detail using the common MOT evaluation measures. The results are shown in **Table 3**.

For each player that should be tracked, the identity switches caused, 5 to 6 additional tracks on average, so there are 5 times more tracks than in groundtruth annotated data. Furthermore, a large number, precisely 1483, of identity switches are present, due to a relatively large number of players in the video that move fast, exit the camera field view, frequently change positions, and occlude each other.

The number of players that are simultaneously present in the frame obviously affects the tracking performance, and according to the IDF1 measure, the players can be correctly identified for 24,7% of the time.

Tracking mistakes can be attributed to several factors. As in all tracking-bydetection algorithms, the accuracy of tracking is greatly influenced by the accuracy of the object detector. If a player is inaccurately detected, the tracking will be inaccurate as well. Furthermore, the scale of an object, occlusion, and the similar color of the players' clothes with the background often cause tracking problems. To overcome these problems, in further work, multiple camera systems will be investigated [21], which can also allow for a more robust generation of a top-view trajectory [22].

#### **Figure 6.**

*An example of tracking situation with occlusion. Top: Hungarian algorithm, bottom: Deep SORT. The left and right frames are 1 second apart.*


**Table 3.**

*Performance evaluation of deep SORT.*

**Figure 7.** *Problem of re-identification after occlusion.*

In **Figure 7**, the top row shows the problem of re-identification after occlusion, and the bottom row the problem of identity switch due to small scale and similar colors.
