Abstract
Developmental dysplasia of the hip (DDH) is a condition in which the acetabular socket inadequately contains the femoral head (FH). If left untreated, DDH can result in degenerative changes in the hip joint. Several imaging techniques are used for DDH assessment. In radiographs, the acetabular index (ACIN), center-edge angle, Sharp's angle (SA), and migration percentage (MP) metrics are used to assess DDH. Determining these metrics is time-consuming and repetitive. This study uses a convolutional neural network (CNN) to identify radiographic measurements and improve traditional methods of identifying DDH. The dataset consisted of 60 subject radiographs rotated along the craniocaudal and mediolateral axes 25 times, generating 1500 images. A CNN detection algorithm was used to identify key radiographic metrics for the diagnosis of DDH. The algorithm was able to detect the metrics with reasonable accuracy in comparison to the manually computed metrics. The CNN performed well on images with high contrast margins between bone and soft tissues. In comparison, the CNN was not able to identify some critical points for metric calculation on a few images that had poor definition due to low contrast between bone and soft tissues. This study shows that CNNs can efficiently measure clinical parameters to assess DDH on radiographs with high contrast margins between bone and soft tissues with purposeful rotation away from an ideal image. Results from this study could help inform and broaden the existing bank of information on using CNNs for radiographic measurement and medical condition prediction.
1 Introduction
Developmental dysplasia of the hip (DDH) is widely known to be the most common etiology for the development of osteoarthritis of the hip. DDH occurs when the ball-and-socket hip joint is underdeveloped, in which the acetabulum (socket) is too shallow for the ball (femoral head) to be secure in the joint. This can lead to subluxation and, in more severe cases, complete hip joint dislocation [1]. Additionally, extraneous tension of connective tissue and tendons surrounding the joint can lead to long-term overcompensation during dislocation [2]. Current diagnostic imaging options include radiographs, ultrasound, computed tomography (CT), and magnetic resonance imaging [3,4]. The use of imaging can be limited according to factors such as interobserver variability, false positives and negatives, and limited reproducibility in follow-up examinations [2–4].
Machine learning has become an increasingly viable means to reduce human error in DDH diagnosis. Machine learning detection algorithms such as You Only Look Once (YOLO) train a neural network and teach the algorithm to process data to make predictions inspired by human input [5]. This has the potential to limit user subjectivity, predict developmental gaps between age progression, and reduce false positives and negatives [6]. Several studies have used neural networks for DDH assessment in radiography [7–11]. One approach utilizes probability predictions through classic machine learning to identify DDH from two- and three-dimensional ultrasounds [12,13]. Other studies have shown improvements in efficiency by implementing a machine-learning network to assist in diagnosis [6,14,15]. However, these studies have not investigated the prediction accuracy on radiographs with poor definition due to low contrast between bone and soft tissues, nor have they used rotated pelvic images that were not perfectly aligned with anatomical planes.
Misaligned radiographs are not designed for the standard assessment metrics, as the metrics are designed to be computed on aligned images. Additionally, the computation is inherently limited because the pelvic structure is a complex three-dimensional shape being represented using a two-dimensional image slice. Reference lines that are meant to be approximately horizontal on an aligned image can be significantly altered or distorted in misaligned images. This can adversely affect the standard deviation and variance of the computed assessment metrics. Few studies analyze misaligned images, but those that do show significant increases in measurement variation in angles, such as lateral center edge angle and Sharp's angle [16]. This increase in variation can be in part attributed to the change of the reference lines. Additionally, these metrics can be further influenced by the obfuscation of overlapping features. Compounding these limiting factors by computing these metrics by hand is not ideal as it adds significant additional risk of intra- and interobserver variability factors. This makes automating the assessment metrics an ideal solution.
Radiographs are typically used for children older than 6 months, and the ACIN, MP, lateral center-edge angle (CEA), and SA are used to assess DDH. The ACIN measures the lateral coverage of the FH by the acetabulum [17]. The MP measures the displacement of the femoral head to the center of the acetabulum [18]. Conversely, both the SA and CEA represent the acetabulum's depth and capacity to cover the FH [19].
This study prioritized the accuracy of the CEA due to its ability to account for variations in the shape and size of the FH and acetabulum compared to the SA. The CEA has a higher rate of reproducibility and is less affected by variations in patient positioning compared to the SA [20]. CEA also possesses the capability to monitor hip plasticity in adaptive changes in the shape and position of the acetabulum with respect to the FH over time [20]. The CEA is a strong indicator for the assessment of DDH and thus is critical for a neural network to identify accurately. The other goal of the study was to determine how the neural network would handle rotated pelvic images that were not perfectly aligned with anatomical planes.
The detection of DDH using medical metrics can be straightforward for an experienced radiographer. However, DDH diagnosis is a time-consuming and repetitive process, which can be detrimental. The purpose of this study was to use a neural network to predict DDH metrics in radiographic images and address the limitations in DDH assessment, providing tools for practitioners by increasing the accuracy of DDH diagnosis.
2 Material and Methods
A set of de-identified CT scans were collected from 60 subjects for use in this Institutional Review Board (IRB) approved study by Rainbow Babies and Children's Hospital and Case Western Reserve University under study number 20211382. The subject set included 30 males and 30 females, with three subjects at each year of age ranging from 8 to 17 years. CT scans with fractures, hip dysplasia, retained hardware, and known intravenous or oral contrast studies were excluded. All CT scans contained healthy osseous structures without apparent deformity.
The 60-subject CT scans were first converted into three-dimensional (3D) models using the hospital's Picture Archiving and Communication System (PACS). These 3D models were subsequently converted to two-dimensional simulated radiographs by reducing the observable slab length to reflect the natural opacity and viewing dimensions of radiographic imaging. This process is displayed in Fig. 1.
The base position of each CT scan was set by aligning the superior aspect of the femoral heads, rotating the pelvis to display symmetric obturator foramina, and centralizing the tip of the coccyx between the pubic tubercles. The simulated radiographs were then manipulated in 3D space to predetermined set points. Each subject's pelvis was rotated along the mediolateral and craniocaudal axes in set increments. There were five specific increments along each axis, leading to a total of 25 pelvic images per subject, as shown in Fig. 2.
Around the craniocaudal axis, the five positions were the coccyx centered between the pubic tubercles, the coccyx rotated to the medial edge of the obturator foramen (each side), and the coccyx rotated midway between these two points (each side). Around the mediolateral axis, the five positions were the tip of the coccyx centered between the pubic tubercles, the superior and inferior pubic rami superimposed, the distance midway between these two points, the pubic tubercles in line with the sacrococcygeal line, and midway between this point and the centered pubic tubercle point.
Images were subsequently saved in each position. Upon collecting the 25 radiographs, each image was adjusted for uniform brightness and contrast, and a standard sharpness of 35% was applied to allow for visualization of radiographic landmarks used in measurements. Therefore, a total of 1500 images were available for training. Two hundred and fifty of the 1500 images were split off the set to function as testing samples to assess the trained algorithm. The remaining 1250 images were used to teach, validate, and test the algorithm. Before training the network, the images were preprocessed to optimize the data.
2.1 Preprocessing Images.
The first step in this process was to resize the images since many of them were not uniformly sized. Note that this study used MATLAB (MathWorks, Natick, MA) to perform all programming and implement a neural network. The images were resized to JPEG images that were 1564 pixels wide by 940 pixels tall with a resolution of 96 dots per inch. Additionally, the photos were filled with extraneous data (e.g., patient information and camera zoom percentage) that was not a part of the hip and could, therefore, be removed based on the consistent location of this data within the image, as shown in Fig. 3. Although the removal process worked well for most of the images, a small number of processed images still had a nominal amount of text. This was due to initial inconsistent sizing, which, when resized, stretched the extraneous text to be inconsistent with the majority of the data samples. This nominal amount of text did not influence the algorithm prediction.
2.2 Labeling Procedure.
The processed images were then labeled utilizing MATLAB R2022B's image labeling tool (image processing toolbox). The previously described metrics (ACIN, CEA, SA, MP) were analyzed, and unique locations were determined to be used to create the ground truth metrics [21]. These locations are the femoral head, the lateral acetabular roof, the triradiate cartilage, and the pelvic teardrop (Köhler teardrop) [22].
Example images of these points were created by medical specialists; the locations were labeled as 6, 1, 2, and 5, respectively, in Fig. 4(a). Note that 3 and 4 refer to the centers of the femoral head and will be computed from the label of 6. Fig. 4(b) describes the labels of the key locations replicated, and some of these locations are abbreviated as follows: Sourcil Sharps MP (SSMP), Sourcil Tönnis P2 (STP2), and FH. The locations shown in Fig. 4(b) were labeled throughout the training set of 1250 images, with the labeling process being reviewed by medical experts. At the end of the procedure, the labeling was reviewed for consistency.
2.3 Algorithm Setup.
The labels were exported and subsequently split into a training and testing set, further subdivided as shown in Fig. 5. Additionally, to reduce computational demand, the images and corresponding label coordinates were uniformly shrunk by a factor of 4 to 391 × 234. Tiny-yolov4-coco was the base network for the algorithm, which possesses 2 detection heads and is trained on the coco dataset. Csp-darknet53-coco was considered; however, despite being the generic base network for YOLOv4, the increase in computational cost coupled with marginal improvement in results prevented it from being used in this study [5]. Four anchor boxes were assigned per detection head. It is important to note that YOLOv4 requires images with pixel length and width that are multiples of 32. This necessitated a slight augmentation performed by a MATLAB transform function to increase the images and labels to the network input size of 416 × 256, which can increase errors in the network. This increase in image and label size was strictly for training, and the network fed the 391 × 234 images for assessment purposes. The data were augmented by flipping and randomly scaling the image to increase the available training data and improve algorithm accuracy. Color change augmentation was ignored due to the images being monochromatic. An example of augmented data is shown in Fig. 6.
Table 1 depicts the pertinent training options used on the network. Three network optimizer algorithms were considered, which were stochastic gradient descent with momentum (SGDM), root-mean-squared propagation (RMSProp), and adaptive moment estimation (Adam). The optimizer algorithm utilized is Adam due to it utilizing adaptive learning benefits from optimizers like RMSProp as well as the benefits of a gradient descent from optimizers such as SGDM [23]. Adam converged to excellent results without overtraining within a period of 25 training epochs; note that the learning rate was unchanged for each epoch as the network converged appropriately, as illustrated in Fig. 7. The bounding box loss was trained using a mean square error loss function, and cross-entropy was used to calculate the classification loss. The network was trained using a multi-GPU setup in parallel on an RTX A4000 and RTX4000 (NVIDIA Corp., Santa Clara, CA).
Training setting | Setting chosen |
---|---|
Maximum epochs | 25 |
Learning rate | 0.001 |
Mini-batch size | 25 |
Batch normalization statistics | Moving |
Network output | Best validation loss |
Training setting | Setting chosen |
---|---|
Maximum epochs | 25 |
Learning rate | 0.001 |
Mini-batch size | 25 |
Batch normalization statistics | Moving |
Network output | Best validation loss |
2.4 Postprocessing Setup.
where x is the specific observation and c is the centroid for Eq. (1).
where n is the number of observations, is the weight for the corresponding observation, and is the corresponding observation. The weighted average used the algorithm-computed scores as the weights and the center point coordinates for the observations. Finally, the last step of postprocessing is to check and make sure a single observation exists on each side of the hip.
3 Results and Discussion
The network efficacy can be analyzed using a combination of machine learning- and result-based metrics. The machine learning metrics provide insight into the network's accuracy and show where training deficiencies relative to the labeling may occur. In contrast, analyzing the results compared to expected values from surrounding literature helps determine the effectiveness of the labeling and network. Statistical analysis was also performed to assist in diagnosing sources of error within the network. Using both metrics in tandem will provide an overview of the general effectiveness of the network chosen and help inform future work.
3.1 Machine Learning Metrics.
where P is precision, and R is recall. The F-score is a generally accepted metric for use in analyzing networks, but it does possess limitations. The F-score has the primary limiting factor that it does not use the true negative in the calculations when used in a confusion matrix. This makes it ideal for situations where distinguishing whether measurements are correctly identified in the appropriate class is not a priority. However, for situations such as facial recognition, predictive analytics, and medical diagnoses, it is less suitable to rely on the F-score, as accurately classifying the appropriate locations is a crucial component to ensuring that the algorithm functions properly [25].
3.2 Medical Metrics.
The results from the algorithm must be compared to known health metric values. The ranges for healthy radiographic metrics are not universally agreed upon as the frame for each metric changes with respect to numerous biological factors. In general, the range for dysplasia grows narrower as the age of the subject increases [26]. This general framework for analysis is tabulated in Table 2. Since the simulated radiographs are for healthy patients, ideally, the predicted metrics should fall within these expected values with a very small tolerance. However, accounting for the rotated images requires a degree of additional tolerance to compensate for the degree of misalignment. In this case, recall that the images were postprocessed utilizing the tolerances outlined in Sec. 2.4 and Table 3.
3.3 Statistical Analysis.
where ci is the number of elements in the bin, N is the number of elements of the input data, and wi is the width of the bin.
3.4 Machine Learning Metric Outputs.
The YOLOv4 network training outputs were compared to ground truths using the equations discussed in Sec. 3.1 and subsequently tabulated in Table 4. For reference, the mean IoU of the network was 0.80. The precision and recall were calculated using the computer vision toolbox function “evaluateDetectionPrecision.” The IoU threshold for the function was set to the value of 0.5, which required half of the box output from the network to overlap with the bounding boxes from the ground truth data to count as an overlap. As seen in Fig. 9, while most of the labels were acceptably accurate, the two most accurate labels were the femoral head and lateral acetabular roof. This is excellent as those two locations are critical to measuring the center edge angle. The average precision and F-measure for those classes were also significantly higher than the remaining classes, as seen in Table 4.
Class | Avg. precision | Avg. F-measure |
---|---|---|
FH | 0.9743 | 0.6899 |
SSMP | 0.9291 | 0.6651 |
STP2 | 0.6105 | 0.5117 |
SO | 0.5422 | 0.4913 |
Class | Avg. precision | Avg. F-measure |
---|---|---|
FH | 0.9743 | 0.6899 |
SSMP | 0.9291 | 0.6651 |
STP2 | 0.6105 | 0.5117 |
SO | 0.5422 | 0.4913 |
This discrepancy can be attributed to the fact that the other 4 labels have a large degree of variability in shape, size, and location, and thus, due to the wide range, the algorithm struggles to identify points correctly. To rectify this, additional data could be used to improve training for these classes. Alternatively, these locations may necessitate a different algorithm type, such as a segmentation algorithm, to correctly identify these specific locations.
3.5 Machine Learning Image Evaluation.
The raw ground truth values were used to judge the algorithm's accuracy. This was done by feeding the label locations into a MATLAB program and coding functions to calculate the angles for both hips. Additionally, a code was developed to show the calculation line locations on the image. Figure 10 shows an example output calculating the migration percentage. Note that the left and right hips are from the subject perspective, not the viewing perspective. The associated metrics in Fig. 10 can be seen in Table 5.
Metric | Ground truth | High quality image | ||
---|---|---|---|---|
Abbrev. | Left hip | Right hip | Left hip | Right hip |
SA | 40.01° | 34.68° | 34.28° | 36.37° |
CEA | 32.38° | 41.91° | 37.76° | 40.95° |
ACIN | 8.42° | 8.62° | 6.93° | 10.87° |
MP | 17.17% | 2.87% | 14.61% | 9.18% |
Metric | Ground truth | High quality image | ||
---|---|---|---|---|
Abbrev. | Left hip | Right hip | Left hip | Right hip |
SA | 40.01° | 34.68° | 34.28° | 36.37° |
CEA | 32.38° | 41.91° | 37.76° | 40.95° |
ACIN | 8.42° | 8.62° | 6.93° | 10.87° |
MP | 17.17% | 2.87% | 14.61% | 9.18% |
The ground truth calculations from Table 5 are compared to the machine learning algorithm predictions to determine the degree of variation in outputs and assist with characterizing the effectiveness of the network. Figure 11 depicts the same image as shown in Fig. 10 using the machine learning outputs, and Table 4 (highlighted in yellow) shows the new calculated values based on the machine learning outputs. As seen in both Figs. 11 and 10, the outputs are close to the ground truth data shown in Fig. 10. Note that the shrunken ground truth images (391 × 234) were used as inputs, and thus the boundary boxes had to be resized by a factor of 4 to calculate and display the metrics on the higher-quality images. For this study, radiographs with high contrast margins between bone and soft tissue were defined as high-quality while radiographs with low contrast margins were defined as low-quality images. While the network produced reasonably accurate results for high-quality radiographs, it struggled to find certain points on low-quality radiographs. This is reflected in Fig. 12 and supported by the metric calculations in Table 6. These differences are unacceptably high and are heavily influenced by the pelvic teardrop locations being misidentified as well as the small shifts in the locations of the femoral head of the left hip. This shows that while the algorithm can locate the general points for analysis, it will need refinement to identify the metrics on rotated images properly.
Metric | Ground truth | Low quality image | ||
---|---|---|---|---|
Abbrev. | Left hip | Right hip | Left hip | Right hip |
SA | 38.78° | 51.92° | 28.55° | 56.68° |
CEA | 45.33° | 37.40° | 50.12° | 32.76° |
ACIN | 4.59° | 1.28° | 5° | 2.38° |
MP | 6.94% | 9.86% | 4.90% | 13.52% |
Metric | Ground truth | Low quality image | ||
---|---|---|---|---|
Abbrev. | Left hip | Right hip | Left hip | Right hip |
SA | 38.78° | 51.92° | 28.55° | 56.68° |
CEA | 45.33° | 37.40° | 50.12° | 32.76° |
ACIN | 4.59° | 1.28° | 5° | 2.38° |
MP | 6.94% | 9.86% | 4.90% | 13.52% |
In some cases, the network has difficulty distinguishing critical locations from background noise. This results in it not being able to find some of the requisite prediction locations, such as the acetabular teardrop and the medial head of the acetabulum, as illustrated in Fig. 13. This, in turn, prevents most of the metrics from being calculated since the teardrop reference line locations are not calculated. The teardrop reference line is used as the orthogonal reference to draw the vertical lines for the CEA and MP calculations.
Overall, the network can consistently identify key locations on high-quality, nonrotated grayscale images. This is of key importance as it proves that it is indeed possible to automate metric analysis using a traditional computer vision detection approach on misaligned radiographs through the automation of this process. While limitations exist, such as when the radiograph is rotated significantly, features are obscured, or contrast is poor, the network will still converge to a solution for most of the requisite network outputs. Steps can be taken to help overcome the limitations which are outlined in Sec. 3.7.
3.6 Statistical Analysis Results.
In order to provide the best assessment of the trends, it is necessary to remove outliers from both the ground truth and fully processed network outputs. However, using the recommended values for healthy metrics from Table 2 does not give regard to the fact that rotated images can be accurately calculated to have larger angles due to an increase in variation from the misalignment [29]. A rule for calculating most of the metrics is to compute them with respect to a teardrop reference line drawn between both Kohler teardrops. The vertical lines to compute factors such as CEA and MP are defined to be orthogonal to the teardrop reference line. As such, if the Kohler teardrops are rotated in a misaligned image, the variation in the results can significantly increase. Taking these factors into account, along with referencing relevant articles for physically measured values, yields an outlier limit table, as shown in Table 3. All values computed to be greater than these were ignored.
The standard deviation and Z-scores for the four metrics between the ground truth and network predictions were compared. For the standard deviation, as shown in Fig. 14, the network prediction variance for each metric stayed within the boundaries of the manually labeled ground truth data. In particular, the network predictions possessed less variance for the acetabular index than the ground truth. The network did not increase the variance of the predictions in comparison to the given ground truth data. This shows that the network has converged to a solution and eliminates concerns of the network being overtrained.
The Z-scores can be seen in Fig. 15, which shows an excellent correlation between the true measures and the estimated ones. It is key to note that the standard deviations used to compute the Z-scores are the manual hand-labeled standard deviations for the corresponding image sets. This is key to understanding as it relates the network observations to the ground truth. The arithmetic mean values used were computed from the prediction sets. The Z-scores were plotted up to five standard deviations off of the mean; however, the majority remained within 3 standard deviations. Additionally, most of the observations lay within 1 standard deviation of the arithmetical mean. This is excellent, given the high precision shown in Fig. 14; this means that a majority of the labeling was consistent from both the hand and network predictions.
3.7 Limitations.
The current limitations of the network stem from its inability to accurately locate some of the labels. Additionally, while 1500 images appear to be a large number in terms of machine learning, it is relatively small. When broken down, it only permits 60 images of each rotation, which may be insufficient to train the network to locate the less distinct points, such as the pelvic teardrop. There are also concerns regarding the fact that most detection algorithms were designed to identify color photographs rather than monochrome images. The addition of color assists in delineating key features of images. Overcoming these limitations will require modification of the approach for calculating the metrics and potentially require replacing and shifting the network layers and learning scheme. This network utilized a transfer learning scheme as it loaded a pretrained network that was subsequently retrained. Replacing the pretrained network with one more suited for medical imagery may improve results. Alternatively, switching the training method to change the weights in all layers of the network fully could potentially improve results. Additionally, it may become necessary to redefine the method of computing the assessment metrics if the orthogonality condition yields inaccurate results.
4 Conclusion
The goal of this study was to use a neural network to predict DDH metrics in radiographic images and address the limitations of DDH assessment. One of the key results noted is that the network responses are precise and statistically correlate with the ground truth labeling using standard deviation and Z-score analysis. Additionally, it was found that image quality plays a key role in whether the network will be capable of predicting the required locations to compute the medical diagnostic metrics. High-contrast images that were rotated along a single axis produced accurate results with predictions that converged. Lower-contrast images rotated along multiple axes either produced unsuitable results or were not capable of producing predictions. The network has significant difficulties with locating the pelvic teardrop and medial head of the acetabulum, which affects the measurements that rely on those locations. Overcoming the limitations will be required to proceed to the next phase of this study, which can be done by performing refinement as outlined in Sec. 3.7.
The applications of this neural network, once refined, can be extended to investigating radiographs where information is missing or corrupted, such as hemipelvic radiographs. Analyzing and correctly quantifying the metrics on datasets with nonideal or omitted information is of significant value. A key point to note is that the radiographic measurements of the values in this initial investigation are directly defined. This is to say that there were no attempts to correct the metrics to a value on an aligned radiograph. A future goal would be to have a machine learning network learn how to identify the bias in the values based on rotations according to the aligned image frame. This type of task is not easy for a human to perform, but a machine could potentially do so. The ability to automatically identify the radiographic metrics for use on longitudinal data is of extreme interest in understanding conditions that affect hip morphology and growth, which is of vital importance to treating conditions such as DDH.
Acknowledgment
This study was supported by the International Hip Dysplasia Institute (IHDI). The opinions, findings, and conclusions, or recommendations expressed are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Funding Data
US National Science Foundation CAREER (Award ID: CMMI-2238859; Funder ID: 10.13039/100000001).
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.
Nomenclature
- ACIN =
acetabular index
- CEA =
lateral center-edge angle
- CT =
computed tomography
- CNN =
convolutional neural network
- DDH =
developmental dysplasia of the hip
- FH =
femoral head
- FN =
false negative
- FP =
false positive
- IoU =
intersection over union
- MP =
migration percentage
- P =
precision
- R =
recall
- RMSProp =
root-mean-squared propagation
- SA =
sharp's angle
- SGDM =
stochastic gradient descent with momentum
- SSMP =
sourcil sharps migration percentage
- STP2 =
sourcil Tönnis P2
- TN =
true negative
- TP =
true positive
- YOLO =
you only look once