**3. Artificial intelligence development and use: woos and woes**

Artificial intelligence is an emerging field broadly defined as a set of technologies capable of incorporating human behavior and intelligence into machines and systems [4]. Due to its potential scope in diagnostic efficacy and treatment recommendations, AI is poised to be increasingly implemented into healthcare and clinical practice. However, a better understanding of what AI entails is warranted.

#### **3.1 Machine learning**

A discussion of AI in neurosurgery would be incomplete without a basic understanding of machine learning (ML), a subfield of AI [5]. The accelerated increase in computerization of patient data in healthcare has resulted in vast quantities of information beyond what can be reasonably digested by traditional methods of statistical analysis, commonly referred to as "big data" [6]. However, the emergence of ML has unlocked new possibilities for the extraction and identification of potentially valuable patterns from not only past data, but also created a framework for predicting future data trends [7–9]. The predictive potential of ML can only be harnessed when the model can be presented with large quantities of annotated data [10]. For instance, in radiographic imaging, ML is able to treat each computerized picture element, or pixel, as its own unique variable. Thus, when fed large quantities of data, the ML algorithm can *learn* at a degree of complexity (e.g., trace contours of fracture lines, parenchymal opacities, etc.) and a scale that is beyond natural human capabilities [10].

Machine learning subdomains have traditionally been grouped into two large categories: supervised and unsupervised learning. The former uses annotated datasets to train an algorithm to predict outcomes on unseen data; unsupervised learning, however, uses ML to cluster datasets without using labels, enabling the extraction of unknown features that may be useful for categorizing and predicting relevant clinical outputs without human intervention [11]. Nevertheless, many ML models in healthcare have been shown to demonstrate performance no better than conventional statistical methods [12, 13]. It should be repeatedly emphasized that the field of ML,

## *Artificial Intelligence: Development and Applications in Neurosurgery DOI: http://dx.doi.org/10.5772/intechopen.113034*

in addition to being new, still possesses many fundamental weaknesses that limit its immediate widespread applicability.

Using diagnostic testing to determine the presence or absence of disease is an essential process in clinical medicine. In these scenarios, test results are oftentimes obtained as continuous values, which require conversion and interpretation into dichotomous groups to determine the presence or absence of a disease [14]. A key stage in this process involves defining a cut-off value, or reference value, to differentiate normal from abnormal conditions. The receiver operating characteristic (ROC) curve, the primary tool used for this determination, classifies a patient's disease state as positive or negative based on test outcomes, simultaneously identifying the optimal cut-off value with the best diagnostic performance [14]. The area under the curve (AUC) serves as a singular, scalar value summarizing the overall performance of a binary classifier [15]. This measure provides an aggregate evaluation of performance across all potential classification thresholds. In essence, the AUC measures the two-dimensional area beneath the ROC curve from points (0,0) to [1,1]. An AUC of 1.0 signifies perfect, error-free classification, whereas an AUC of 0.5, comparable to a random classification method like a coin toss, holds no diagnostic value. Typically, an AUC exceeding 0.8 is deemed acceptable in non-medical contexts, and an AUC surpassing 0.9 is considered excellent [16].

Nonetheless, it is crucial to underscore that strong performance as indicated by AUC values greater than 0.80 does not necessarily guarantee a robust model. If machine learning algorithms have not been cross-validated with novel datasets, they risk being overfit to past data, compromising their generalizability [14]. Thus, when attempting to leverage the model to predict performance on unseen data, the ML model may, at best, only offer slight gains compared to traditional statistical analysis [12, 13, 17–19]. Additionally, the robustness of any given ML model is directly dependent on the quality and quantity of data fed. If biases from differences in data collection methodologies are present in a dataset, both generalizability and performance of the model are negatively impacted [10]. Furthermore, the AUC is often presented with a 95% confidence interval because the data obtained from the sample are not fixed values but rather influenced by statistical errors. Finally, the use of real-world data inherently introduces corruptions in the dataset, also known as "noise." Random noise in input datasets can confound ML tasks of classification, clustering, and association analysis in addition to increasing model complexity and time of learning, all of which can degrade the performance of the learning algorithm as noise cannot be easily distinguished from desired inputs unless appropriately pre-processed before introduction to the model [20, 21]. In other words, despite impressive AUC values, such models may lack reliability when applied to new, unseen data, underscoring the critical importance of rigorous validation processes in the development of diagnostic tools.
