**7.1 LSTM**

Unlike CNNs and other so-called feedforward neural (FFNN) networks, recurrent neural networks (RNNs) [23] have connections that feed the activations of an input in a previous time step back into the network, to influence the output for the current input. These activations from the previous time step are held potentially indefinitely in the internal state of the network, so the temporal context is not limited to a fixed window that could be used as an input to a FFNN. This property makes RNNs especially appropriate for modeling sequences, such as text or a sequence of video frames in action recognition.

Long Short-Term Memory (LSTM) [24] is a type of recurrent neural network (RNN) designed to model temporal sequences and their long-range dependencies

**157**

**Figure 9.**

*Application of Deep Learning Methods for Detection and Tracking of Players*

more accurately than conventional RNNs. In the recurrent hidden layers, the LSTM contains special units called memory blocks. Those units contain memory cells, that have self-connections storing the temporal state of the network, and gates, which are special multiplicative units that control the flow of information. The flow of input activations into the memory cell is being controlled by the input gate, while the output gate controls the output flow of cell activations into the rest of the network. The forget gate scales the internal state of the cell before adding it as input to the cell through the self-recurrent connection of the cell, thus causing the adaptive

In the experiment handball actions from the PAR-Handball dataset are considered: Throw, Catch, Shot, Jump-shot, Running, Dribbling, Defense, Passing, Doublepass, and Crossing. An example of jump-shot action is shown in **Figure 8**. The action consists of a sequence of different phases of jump-shot action from running, take-

Different actions can take different amounts of time to perform, so the average number of frames in a video depends on the action class it belongs to, as shown in

Only Throw and Catch actions are significantly shorter (two or three times)

Because these two actions are also parts of the more complex ones, like Passing,

The model selected for action recognition is a LSTM-based network with one LSTM layer with 1,024 units, followed by a dropout layer with 0.5 dropout, one

off, flight, throw, and landing that are captured on different video frames.

than the other action classes that have a duration of around 60 frames on

the model is trained once with all 11 classes and once with 9 classes, excluding

*DOI: http://dx.doi.org/10.5772/intechopen.96308*

forgetting or resetting the cell's memory.

**7.2 The action recognition experiment**

**Figure 9**.

average.

**Figure 8.**

Throw and Catch.

*Active player collage for jump-shoot action.*

*Average number of frames per action from the handball dataset.*
