**3. Machine learning model design**

Like the data management system, machine learning models designed for precision medicine need to be able to accept new data on an ongoing basis. The data contents may change over time as new discoveries about the illness are made, though it can be safely assumed that new data will be related to old data in some way. Contents of new data cannot be expected to be well distributed across target all target metrics. All of these requirements make precision medicinal systems a perfect use-case for continual learning systems.

Continual learning systems are characterized by their iterative training, as well as the ability to 'recall' what they learn from prior tasks to help solve new ones. Each of these tasks are assumed to be related, but contain non-trivial variation. This means the model must be flexible to change, while avoiding completely reconstructing itself after each new task, which could result in it 'forgetting' useful prior learning. These capabilities are referred to, respectively, as forward transfer (the ability to leverage prior learning to improve future analyses) and backward transfer (the ability leverage new knowledge to help with prior tasks).

Promising progress has been made in designing continual learning systems [20], to the point of a preliminary frameworks being devised to develop them. For this chapter, we will be using Fayek et. al's *Progressive Learning* framework [29] as a baseline reference, though some changes were made to account for precision medicine applications.

### **3.1 Initial network structure**

All networks need to start somewhere, which for all intents and purposes acts like a classical static machine learning system. Neural networks are the system of choice for these processes, as they allow for multiple data types to be analyzed simultaneously, being able to be constructed in a modular fashion to match up with our data storage structure detailed prior. Depending on the data, what this entails will differ. For data with implicit relations between features (such as MRI images with their spatial relations), Convolutional Neural Network (CNN) systems have been shown to be extremely effective [30]. CNNs are also among the most computational efficient neural networks to train and run [31, 32], making them ideal for low resource systems. For other data, a Densely Connected Learning Networks (DCLNs) may be more appropriate. The complexity of these networks can be tuned to fit the problem. These models tend to be over-parameterized, however, potentially causing them to "stick" in one place or over-fit to training data; this is mediated somewhat via model pruning, discussed later in this section. The choice of available models is ever-changing, however, so one should find the model structure which best fits their specific case.

### *Delivering Precision Medicine to Patients with Spinal Cord Disorders; Insights into… DOI: http://dx.doi.org/10.5772/intechopen.98713*

For progressively learning models, one further constraint exists; it must be able to be cleanly divide its layers into 'blocks'. As discussed in Fayek et. al's framework [29], this is necessary to allow for the model to progress over time. How these blocks are formed can be arbitrary, so long as each block is capable of being generalized to accept data containing the same features, but of different shape (as the size of the input data grows resulting from the concatenation operation discussed later in this section). One should also keep in mind that the block containing the output layer will be reset every progressive iteration, and should be kept as lightweight as possible.

For DCM, this would be accomplished via multiple layers running in parallel. For MRI inputs, being 3D spatial sequences, something like a 3D DenseNet similar to that employed by Ke et al. [33] could be used. The DenseNet could be run alongside DCLN blocks in parallel to read and interpret our linear data (demographics, for example), grouped with the DenseNet blocks to form the initial progressive learning model. A diagram of this structure, using the same structure mentioned prior (**Figure 1**), with a simplified "MRI" model present, is shown in **Figure 2**.

For the purposes of comparison with the original *Progressive Learning* framework, however, our testing system will instead use their initial model structure [29].

### **Figure 2.**

*An example of the initial neural network structure for use in precision medicinal systems. Note that each form receives its own "branch" block (presented within the model block column) which is used to interpret the form's contents. As a result, each branch's structure can be tailored to suit the form's contents, allowing for modular addition or removal of model's feeding into the network's design as needed. The results of each of these branches' interpretations are then submitted to a set of "merging" blocks, which attempts to combine these results together in a sensible manner, before a final "output" layer reports the model's predictions for the input. The output layer is also modular, allowing for extension and/or revision as desired.*

### **3.2 Iterative training data considerations**

Once this initial framework is in place, it then needs to prove capable of accepting new patient data, updating itself as it does so. Given that the measurements of patients enrolling in clinical illness studies can be sporadic in terms of when and how often they are made, data for this system will need to be collected over time until a sufficiently large 'batch' of new records is acquired. Ideally large batches would be collected which are sizable enough to be split into multiple smaller batches, allowing for curriculum formation as detailed in the subsequent section. In many cases this is simply not feasible due to the time required to obtain such large batches. In this circumstance, each batch acts as a single 'curriculum' provided to our network in effectively random order. Thankfully, the curriculum stage appears to be the least significant stage of the Progressive Learning framework [29]. The size of these batches will depend heavily on how much data one expects to be able to collect in a given period of time and how regularly one wishes to update the model. For categorical data, each batch should include at least two of every category (one for testing, one for validation), which may influence how many samples one needs to acquire. We recommend a slightly larger batch sizes when linear data is brought into the fold, to account for the increased variety. With our DCM dataset, with a categorical output metric (the mJOA-derived DCM severity class, consisting of 4 classes), a batch of 20 patient records was selected. Data augmentation of this *new* data can also be utilized to increase the number of effective records being submitted. However, one should avoid using data from records in previous training cycles, as it can lead to a the model failing to adopt novel data trends in newer results.

### **3.3 Continual learning**

Here, we will focus on detailing a framework based on Fayek et. al's Progressive Learning Framework [29], which consistes of three stages; *curriculum*, *progression* and *pruning*.

### *3.3.1 Curriculum*

Given sufficiently large batches of new data can be collected in a timely manner, one can utilize the curriculum stage; at least three times the number of records per batch being collected in a 6 month period seems to be a good cutoff for this, though this can differ depending how rapidly one expects disease trends to change. This stage, as described in Fayek et. al's framework [29], is composed of two steps; curricula creation and task ordering. In the creation step, the batch is split into sub-batches, with each being known as a 'curriculum'. How this is done depends on the data at hand (i.e. categorical data requires that each curriculum contains data from each category), but can otherwise be performed arbitrarily. Once these curricula are formed they are sorted based on an estimate of how "difficult" they are, from least to most. Difficulty estimation can be as simple as running a regression on the data and using the resulting loss metric. The sorted set of curricula are then submitted to the network for the progression and pruning stages, one at a time. This allows for the network to learn incrementally, picking up the "easier" trends from earlier curricula before being tasked with learning more "difficult" trends in later ones.

In precision medicine, however, collecting sufficient data in a useful time span is often not possible. In this case, this stage can be safely skipped; the smaller batches will simply act as randomly sampled, unordered curricula. How "large" this is depends on the

### *Delivering Precision Medicine to Patients with Spinal Cord Disorders; Insights into… DOI: http://dx.doi.org/10.5772/intechopen.98713*

batch size selected earlier; one should collect at least three batches worth to warrant the additional complexity of the curriculum stage. For our DCM setup, we fell below this level, as we intend on updating as often as possible, and as such intended on utilizing new batches as soon as they were collected. We believe that in most precision medicine examples this is likely to be the case, though in some situations (such as the ongoing COVID-19 pandemic), the scale of patient data collection may make the curriculum stage worth considering. Employing few-shot learning techniques may also allow for smaller subsets of data to form multiple batches as well, though the efficacy of such procedures has yet to be tested in this context.

### *3.3.2 Progression*

In this stage, new blocks of layers are generated and concatenated to previously generated blocks in the model. The new input-accepting block is simply stacked adjacent to the prior input blocks in the model, ready to receive input from records in our dataset. Each subsequent new block, however, receives the concatenated outputs of *all* blocks from the prior layer, allowing it to include features learned in previous training cycles. The final block, which contains the output layer, is then regenerated entirely, resulting in some lost training progress that is, thankfully, usually quickly resolved as the model begins re-training.

The contents of these added blocks depends on the desired task and computational resources available. Large, more complex blocks require more computational resources and are more likely to result in over-fitting, but can enable rapid adaption of the network and better forward transfer. In the original framework [29], these blocks were simply copies of the original block's architecture, but reduced to approximately half the parameters. However, one could instead cycle through a set of varying block-types, based on how well the model performed and whether new data trends are expected to have appeared. They could also be changed as the model evolves and new effective model designs are discovered, though how effective this is in practice has yet to be seen.

Once these blocks are added, the network is then retrained on the new batch of data, generally with the same training setup used for the original set of blocks. During this retraining, prior block's parameters can be frozen, locking in what they had learned prior while still allowing them to contribute to the model's overall prediction. This prevents catastrophic forgetting of previously learned tasks, should they need to be recalled, though this usually comes at the cost of reduced overall training effectiveness. However, if one does not expect to need to re-evaluate records which have already been tested before, one can deviate from Fayek's original design and instead allow prior blocks to change along with the rest of the model. An example of progression (with two simple DCLN blocks being added) is shown in **Figure 3**.

For our DCM data, this is a pretty straightforward decision. New blocks would simply consist of new 3D DenseNet blocks run in parallel to simple DCLN layers, both containing approximately half the parameters as the original block set. The output block is then simply a linear layer which is fed into a SoftMax function for final categorical prediction. As we do not expect prior records to need to be re-tested, we also allow prior blocks to be updated during each training cycle.

### *3.3.3 Pruning*

In this stage, a portion of parameters in the new blocks are dropped from the network. What parts of the model are allowed to be pruned depends on how the prior

### **Figure 3.**

*An example of the progression stage, building off of the initial model shown in Figure 2. New nodes are contained within the gray boxes, with hashed lines indicating the new connections formed as a result. Note that input connections are specific to each form, only connecting to one's inputs (in this case, only the Demographic's input), and not to those in the other branches (such as the MRI branch); this allows for shortcomings in particular model's contributions to be accounting for independently, without an extreme growth in network complexity. Note as well that the merging layer (representing all non-input receiving blocks) forms connections with all prior block outputs, however, regardless of which forms have received a new connected block. The entire output block is also regenerated at this stage, providing some learning plasticity at the expense of initial learning.*

progression stage was accomplished; if previously trained blocks have been frozen, then only newly added elements should be allowed to be pruned to avoid catastrophic loss of prior training. Otherwise, the entire model can be pruned, just as it has been allowed to be trained and updated. The pruning system can also vary in how it determines which parameters are to be pruned, though dropping parameters with the lowest absolute value weights is the most straightforward. These can also be grouped as well, with Fayek et al. choosing to prune greedily layer-by-layer. However, we have found that considering all parameters at once is also effective. The proportion *q* dropped per cycle will depend on the computational resources and time available. Smaller increments will take longer to run, whereas larger values will tend to land further away from the "optimal" state of the network. An example of the pruning stage is shown in **Figure 4**.

The network, now lacking the pruned parameters, is then retrained for a (much shorter) duration to account for their loss. In Fayek et al's example, this process is then *Delivering Precision Medicine to Patients with Spinal Cord Disorders; Insights into… DOI: http://dx.doi.org/10.5772/intechopen.98713*

### **Figure 4.**

*An example of the pruning stage, building off of the progression network shown in Figure 3. Note that only newly formed connections are targeted for pruning by default, with pre-existing connections remaining safe. Parameters themselves can also be effectively lost entirely (shown as nodes with no fill and a dashed outline) should all connections leading into them get removed. This results in all connections leading out of them also getting pruned by proxy.*

repeated with progressively larger *q* proportions until a loss in performance is observed. Alternatively, one can instead repeatedly drop the same percentile of parameters each cycle from the *previously pruned* network. This has the benefit of reducing the time taken per cycle slightly (the same weights do not need to be pruned every cycle), while also leading to the total proportion of the pruned model increasing more gradually per cycle, improving the odds that the model lands closer to the true optimal size for the system. This pruning system has the potential to be much slower, however (should the rare circumstance occur where all the new parameters are useless, requiring more iterations overall). As such, in time-limited systems, Fayek's approach remains more effective.

This stage also allows for, in theory, dynamic feature removal. Should a model (or feature within said model) cease to be available, one can simply explicitly prune the parameters associated with that feature, in effect performing a targeted pruning cycle. One would need to re-enable training of previously trained nodes to account for this, however, leading to the possibility of reduced backward transfer. Depending on how significant the to-be-removed features have become in the network, this may need to be done over multiple pruning cycles; this should allow the network to adapt to changes over time, reducing the risk of it getting 'stuck' in a sub-optimal state.

For our DCM data, the complexity of the illness and scope of the data makes it extremely unlikely for a worst-case pruning issue to occur. As such, a 10% lowest absolute weight pruning system, applied globally, is selected as our starting point, iteratively applied until a loss of mean accuracy over 10 post-prune correction epochs is observed.
