I. Introduction Lumbar spinal stenosis (LSS) is a type of lumbar degenerative spine disease and is among the most common causes of spine surgery. Its radiographic and anatomical findings is characterized by narrowing of the spinal canal (1) and typically involves L4-L5, L5-S1 levels and less often L3-L4 levels (2). Narrowing may occur in the central spinal canal, in the area under the facet joints, or more laterally, in the neural foramina. Amongst the features that are specific to the lumbar spinal stenosis are bulging of the intervertebral disk, thickening of the ligamentum flavum, and hypertrophy of the facet joints based on axial view. Features such as loss of disk height, disk protrusion, and facet-joint osteoarthritis, all leading to foraminal stenosis (when stenosis affects the spinal foramen), based on sagittal view. Among the clinical symptoms of lumbar spinal stenosis are lower extremity pain, weakness, and low back pain (LBP) and can lead to a reduction in the quality of life (3). In case of severe chronic pain, the LSS patients may benefit either of instrumented spinal fusion surgery (4) or decompression surgery (5). Complications caused by fusion surgery have been reported including higher morbidity, pseudarthrosis and degeneration of adjacent segments (6). Therefore, the right and timely diagnosis is extremely important. Based on clinical symptoms, both surgeons and physicians specify the severity of stenosis and make the decision for the type of the lumbar decompression surgery. However, agreement in deciding the severity or level of stenosis and the classification of stenosis among radiologists, neurologists and surgeons may be poor (7-9) as well a poor correlation between clinical symptoms and signs, and radiology findings could be present (10). These limitations result in a considerable amount of subjective judgment for decision making in lumbar decompression surgery, and level decompressed among surgeons (11). In addition to this, studies have shown that there is a wide variability in lumbar spinal canal dimensions among patients who do not have clinical spinal stenosis (2, 12). Karantanas et al.(2) showed that both the somatometric parameters as well as the age have statistically significant correlation with many of the measured indices. Other studies have shown measured L4 canal diameter in the black population for males and females as 15.6 mm and 14.1 mm, respectively (13). There is a difference in measured LSS parameters between individuals from different sex and age. Twomey et al. (14) compared two adult age groups in both males and females and showed a significant decline in the lumbar spinal canal anteroposterior diameter in both sexes for the older group. Differences in the spinal canal cross-sectional area of the lumbar spinal canal of women and men were reported as well by Griffith et al. (12). Computer algorithms are increasingly entering medicine, however, they are currently focused mainly on analyzing medical images of the brain, heart, and lungs. Resistance to use computer diagnostic systems is declining (15), and doctors increasingly appreciate the potential that computer diagnostic products in medicine can provide. In the field of LSS diagnosis, the use of neural network for automated MRI grading turns out to be of invaluable assistance (16, 17). The use of dedicated software does not change the usual radiologist’s workflow. One such application is the CoLumbo software (https://columbo.me/), built to provide confirmation for users to accept or reject an output from optional analysis and is not intended to replace the clinician's diagnosis. The output generated from this software is not intended to be used directly for final diagnosis, which is the sole responsibility of the clinician. It only provides the results with the findings in a text form suitable for further reporting. The objective of this study is to evaluate and demonstrate the safety of CoLumbo's usage for assessment of findings in the lumbar region and in providing segmentation and measurements. This is achieved by measuring the accuracy of a radiologist using the software versus a radiologist not using the software and the accuracy of the artificial intelligence (AI) algorithm itself. The specific objectives of this clinical testing are to prove: (a) that the accuracy of a radiologist using CoLumbo is not worse than the accuracy of a radiologist not using CoLumbo; (b) the algorithm's accuracy for assessment. II. Materials and methods The prospective study "Clinical Trial with Columbo software" was accepted by the Ethical Committee for Clinical Trials of the Ministry of Healthcare of *** with a protocol *** from 06.08.2020. A. Patients Data and Radiologists The target population is patients referred to L-Spine MRI for back and/or leg pain or other spine-related symptoms. To reduce the probability of deviation due to the selection of specific patients, a prospective multicenter study on consecutive patients is conducted to cover various cases and avoid variance by gender, age, or type of disease. The clinical investigation with CoLumbo software was organized in three medical centers from different locations in ***, where the software was installed and only three researchers (each one per medical center) had access to it. The investigation was carried for a period of two months September and October 2020. The number of participants in each center was between 100 and 150, while the total number was 382. All these cases are acquired in the centers involved in the clinical study, where the investigators are working. There was no need for a separate control group since the product under trial does not have a therapeutic effect on patients. Patients below the age of 18 or over 70 years compared with persons between 18 and 70 as well as pregnant women and persons with concomitant pathology – scoliosis were not included in the clinical study, since there is a significant difference in the spine's morphology in these persons. Five different certified radiologists in three different centers participated equally in each of the 3 roles - radiologists using, not using the software, and arbiter. In case of disagreement, a third radiologist (investigator-arbitrator) with access to the tool was used to establish majority opinion. All of the radiologists were with more than 20 years experience with MRI reading. Radiologists working with the software and those without the software were independent per specific case for every case, but as a whole, they were rotated between these two roles. The patient’s images used in the trial are from four different MRI machines, three models (Aera, Signa HDxt, Verio) from two different manufacturers (SIEMENS, GE MEDICAL SYSTEMS). Different machine protocols are used, common to the centers where the study was performed, but still with mandatory axial and sagittal T2 series, including 2D and 3D. The 382 consecutive patients in three different centers referred for L-spine MRI were prospectively analyzed for the presence of central stenosis at all lumbar levels. Among 382 studies, there were 3 to 5 (4.63 on average) levels with available sagittal and axial images per study; the total number of evaluated levels was 1762. B. COLUMBO software COLUMBO software supports some of the spine's most common pathologies: disc herniation, bulging, stenosis, spondylolisthesis, hypo-, and hyper-lordosis. It is based on AI algorithm originally developed by Georgiev et al. (18). CoLumbo version 2.0 is a software for visualization and analysis of lumbar spine’s medical MRI images. It is an assistant type of software whose main task is to detect a set of common pathologies through the integrated-into-it artificial intelligence. CoLumbo evaluates these pathologies' characteristics and gets the radiologist's attention to them, marking relevant tissues and measurements with different colors in the images and automating part of the report writing. Its current version is intended for use only by radiologists in medical institutions (radiologist and spine surgeons in USA). Convolutional neural networks provide segmentations. Using standard geometric operations like drawing tangents, bisectional lines, and projections, CoLumbo determines standard measurements like distance, area, and angles similar to almost all other AI-based segmentation algorithms. The supported field strength is 1.5T and 3T. All brands and models are supported. A screen shot from the Reports module of the pre-commercial deep-learning based AI tool - CoLumbo is depicted in Figure 1. This figure shows the segmented Dural Sac (light blue), vertebral body (green), intervertebral discs (dark blue), lamina and spinous process (dark-purple), ligamentum flavum (brown), herniation (red), nerve roots (light red), aorta (purple) and sacrum (light-green). This module is used by the radiologists to evaluate the presence of central stenosis with the assistance of the software. Figure 1. A screenshot from the Report module of the CoLoumbo software: Dural Sac (light blue), vertebrae (green), intervertebral discs (dark blue), lamina and spinous process (dark purple), ligamentum flavum (brown), herniation (crimson), nerve roots (light red), aorta (purple) and sacrum (light green). C. Segmentation, measurements and statistical analysis The 382 consecutive patients in three different centers referred for L-spine MRI were prospectively analyzed for the presence of stenosis at all lumbar levels. The grading is based on assessment of both sagittal and axial images. The software segments the tissues in both type of images and radiologists studied a binary ‘presence’ or ‘absence’ of stenosis on the images. CoLumbo provides segmentation of the following tissues: (a) vertebra (on axial and sagittal slice around mid-sagittal, 35 mm); (b) part of the disk without the herniation (on axial slice, on sagittal slice, around mid-sagittal, 35 mm); (c) part of the disk with the herniation (on axial slice) without extraforaminal and sequestered part; (d) Dural Sac (on axial slice); (e) ligamentum flavum (on axial slice); (f) nerve roots (on axial slice); (g) aorta and/or iliac artery (on axial slice); (h) sacrum (on sagittal slice). Diagnosis of central stenosis classification is provided based on the Dural Sac cross-sectional area less than 100mm2 (19). The performance of radiologists aided and radiologists not assisted by CoLumbo and software performance are evaluated by using accuracy, and level of agreement. Sensitivity, specificity, the positive predictive value and the negative predictive value are used to evaluate the software performance. Sensitivity measures the proportion of positives that are correctly identified and are given as: Sensitivity=TP/(TP+FN) (1) Specificity is defined as correctly classified cases that are negative (i.e. the proportion of those levels who do not have the condition (unaffected) who are correctly identified as not having the condition). Specificity=TN/(TN+FP) (2) where TP = number of true positives, TN = number of true negatives, FN = number of false negatives, FP = number of false positives. Kappa statistics is used to test the interrater reliability . The Kappa values (0–0.20), (0.21–0.40), (0.41–0.60), (0.61–0.80), (0.81–1.00) corresponded to slight, fair, moderate, substantial, and almost perfect (20, 21). Further, the positive and negative predictive values (PPV and NPV) were calculated as follows: PPV=TP/(TP+FP) NPV=TN/(TN+FN) (3) III. Results From the 1762 lumbar levels, there were 156 debatable cases, i.e. disagreements between the radiologists, using the software, and radiologists, not using the software for the presence of central stenosis. In 18 of these lumbar cases, the consensual or predominant opinion has coincided with that of a radiologist not using the software. In 138 lumbar level cases, the former has coincided with that of a radiologist using the software CoLumbo. The average accuracy of radiologists for the presence of central spinal stenosis is shown in Figure 2 for selected age groups, gender and centers. Figure 2. Comparison of the accuracy of radiologist for the presence of spinal stenosis for selected gender and age groups, as well as medical centers. The measured sensitivity and specificity of the software were 127/137 (92.70% ± 4.36%) and 1644/1660 (99.04% ± 0.47%), respectively. The average sensitivity and specificity of the software for central stenosis derived from the clinical trial are shown in Figure 3. Further, the PPV and NPV were calculated to be 88.81% ± 5.31% and 99.40% ± 0.42%, respectively. Figure 3. Comparison of the sensitivity and specificity of the software for the presence of spinal stenosis for selected gender and age groups, as well as medical centers. From the studied patients, the average patient age was 49.52 ± 13.20, from these female patients were 53.4 %, while male patients were 46.6 %. An example of detected inaccuracy of the algorithm is revealed in Figure 4. Figure 4A and 4B show detected by the algorithm central stenosis with Dural Sac cross-sectional area over 100 mm2, and measurements of naturally reduced sac at L4/L5 level, respectively. Figure 4C reveals a case of inaccuracy made by a radiologist not using the software, presumably due to the fact that central stenosis is axially visible only on vertebrae level. The results from the kappa agreement analysis showed an overall interrater reliability of 92.9%, 89.9% and 73% for radiologist using CoLumbo software, CoLumbo software alone and radiologist, without using the software, respectively. The kappa agreement reveals an almost perfect agreement with the majority opinion for CoLumbo and the radiologist and the software itself (20). Figure 4. Software performance during measurements and segmentation. IV. Discussion Magnetic resonance imaging is the gold diagnostic standard for assessment of the degree of lumbar spinal stenosis and its classification. However, MRI reading is time-consuming (22), costly, and prone to errors (23). In this respect, the use of software applications, such as CoLumbo, would reduce the time needed for MRI reading and reporting without decreasing the accuracy of the final report for some pathologies and improving it for some other. This prospective study demonstrated successive evaluation of the software performance, showing very good sensitivity, specificity, the positive and negative predictive values of the software. This inter-reader study also showed an excellent agreement for the radiologists, using CoLumbo versus the majority opinion, which in fact is very good output in comparison to the lack of agreement between radiologists shown by several inter-observer studies, with kappa varying between 0.26 and 0.65 (24-26). Moreover, the recent review on AI and CAD systems used for diagnosis of low back pain reported on four AI studies for spinal stenosis (17), demonstrating similar sensitivity, specificity, as well as accuracy. However, all of these are retrospective studies. Reasons for the 156 debatable cases are summarized as following: (a) disagreements near the classification thresholds/borderline cases, (b) stenosis at vertebral body level (c) reduced sac due to anatomical reasons/variation and (d) stenosis at sacral level. Figure 5A, B shows two borderline cases. Specifically, Figure 5A reveals a case with a Dural Sac cross sectional area of 106 mm2, which can be considered a borderline case. This was a source of disagreement between the radiologist not using the software and the majority opinion. Such cases are a source of not only interrater but also intrarater disagreements. For the case, shown in Figure 5B, the radiologist without the software reported on a lack of stenosis; the radiologist assisted by the software and the software standalone (95 mm2) reported on a stenosis; the arbiter radiologist reported the case as stenosis. Figure 5. Borderline cases: Dural Sac cross-sectional area (A) 106 mm2; (B) 95 mm2. Another disagreement case is demonstrated in Figure 6, showing naturally reduced sac at L4/L5 level. In this figure, the AI algorithm reported central stenosis, based on the calculated 63 mm2 cross-sectional area at L4/L5; however majority of radiologists disagreed. The reason in this case could be attributed to the Dural Sac that naturally terminates more cranially. Interestingly, at L5/S1, even though the area is even smaller, the algorithm identifies that the cross-sectional area is logically to be naturally small. Figure 6. Naturally reduced sac at L4/L5 level. Figure 7 shows central stenosis with a Dural Sac cross-sectional area over 100 mm2. In this case a different type of disagreement is reported, the cross-sectional area is more than 100 mm2. The unanimous opinion is that this is a case of central stenosis. The radiologist using the CoLumbo software corrected the algorithm suggestion. This particular case supports the kappa results showing qualitatively why the combination of radiologist assisted by the software is better than both the algorithm and the radiologist not using the software. Figure 7. Central Stenosis with Dural Sac cross-sectional area over 100mm2. Finally, Figure 8 reviews central stenosis at sacral level due to epidural lipomatosis. In this case, the software does not report central stenosis at sacral level as the Dural Sac cross sectional area can naturally decrease. However, it is still possible, but the criterion should not be 100 mm2. A physician may virtually imagine how big the sac should be at the appropriate level. Figure 8. Central stenosis at sacral level due to epidural lipomatosis. Limitation of this study. In this study, we used only images of patients, undergoing MRI examination. Other imaging modalities, such as CT scan with contrast dye as well as an electrical test of muscle activity, to validate the presence of stenosis were not used. For the debatable cases, the ground truth was also based on MRI and the arbiter is an MRI radiologist. Discrepancy could be also due to this lack of validation with other modalities. The fact that data are from three different clinical sites alongside of ***, received from different MRI systems is also a possible source for discrepancy. V. Conclusion This prospective study showed that the assessment of the radiologists supported by deep learning system for central stenosis classification results in high kappa agreement. The introduction into practice of such AI-based tools would precisely predict the presence of stenosis and thus decrease the observer variability in assessing lumbar spinal stenosis severity based on MRI and its relation to cross-sectional spinal canal area. This would result in timely and effective surgical treatment and improved quality of life for these patients.