The widespread application of high-performance casting parts in industries such as automotive and electronics places stringent demands on their internal quality. Defects like shrinkage porosity, gas holes, inclusions, and cracks, formed during the casting process, can significantly compromise the structural integrity and performance of the final casting part. X-ray non-destructive testing (NDT) has become a primary method for inspecting the internal structure of these components due to its penetration capability and non-invasive nature. However, manual inspection of X-ray images is labor-intensive, subjective, and inefficient, especially for small, low-contrast, or irregularly shaped defects inherent to complex casting parts.
While traditional machine vision methods relying on handcrafted features have been explored, they often lack robustness and generalizability across different casting part geometries and defect types. The advent of deep learning has revolutionized automated visual inspection. Two-stage detectors like Faster R-CNN offer high accuracy but at a computational cost unsuitable for real-time deployment. Single-stage detectors like the YOLO series provide an excellent speed-accuracy trade-off, making them attractive for industrial applications. Nevertheless, directly applying standard models to casting part defect detection faces significant challenges: the features of small defects are easily lost in deep networks; weak contrast between the defect and the surrounding material in the casting part complicates feature extraction; and the multi-scale nature of defects—from tiny pores to large shrinkage cavities—requires sophisticated feature fusion strategies.
To address these challenges specific to casting part inspection, this paper proposes an enhanced object detection framework named X-YOLOv5s. The core improvements focus on strengthening feature representation and fusion capabilities to better handle the unique difficulties presented by casting part X-ray imagery.

Enhanced Model Architecture
The proposed X-YOLOv5s is built upon the efficient YOLOv5s backbone. The overall architecture integrates several key modifications designed to boost performance for casting part defect detection, as summarized in the following functional table:
| Network Component | Base Model (YOLOv5s) | Proposed Model (X-YOLOv5s) | Primary Benefit for Casting Part Inspection |
|---|---|---|---|
| Attention Module | None | NCA (New Coordinate Attention) | Enhances focus on small, irregular defect regions within the casting part. |
| Neck Structure | PANet | BiFPN (Bidirectional Feature Pyramid Network) | Improves multi-scale feature fusion for defects of varying sizes in the casting part. |
| Regression Loss | CIoU Loss | EIoU Loss | Increases bounding box localization accuracy for precisely defining defect boundaries in the casting part. |
| Input Processing | Mosaic Augmentation | Mosaic + Histogram Equalization & Gamma Transform | Improves model robustness to contrast variations common in X-ray images of casting parts. |
1. NCA: Enhanced Coordinate Attention for Casting Defects
To direct the network’s focus toward critical but subtle defect features in a casting part, an improved attention mechanism is integrated. The standard Coordinate Attention (CA) module captures long-range dependencies along spatial dimensions. However, for the nuanced task of identifying faint defects in a casting part, a more discriminative mechanism is beneficial. We propose the NCA module, which incorporates a channel attention sub-module based on the Normalization-Based Attention Module (NAM). This addition allows the network to suppress less significant channel features, which is crucial when defect signals are weak relative to the overall casting part structure.
The NCA module operates as follows. For an input feature map $\mathbf{X} \in \mathbb{R}^{C \times H \times W}$, it first performs coordinate information embedding using 1D global pooling along the height and width axes, yielding feature maps $\mathbf{z}^h \in \mathbb{R}^{C \times H \times 1}$ and $\mathbf{z}^w \in \mathbb{R}^{C \times 1 \times W}$. These are then concatenated and transformed via a shared $1 \times 1$ convolution, followed by batch normalization (BN). The scaling factors from BN, denoted as $\gamma$, are used in the NAM-inspired channel weighting. The intermediate feature map $\mathbf{f}$ is processed as:
$$ \mathbf{f} = \delta( F_1( [\mathbf{z}^h, \mathbf{z}^w] ) ), \quad \text{where } \mathbf{f} \in \mathbb{R}^{C/r \times 1 \times (H+W)} $$
Here, $\delta$ is the HardSwish activation function, chosen for its efficiency and gradient properties, defined as:
$$ \text{HardSwish}(x) = x \cdot \frac{\text{ReLU6}(x+3)}{6} $$
The feature map $\mathbf{f}$ is then split back into $\mathbf{f}^h$ and $\mathbf{f}^w$. After separate $1 \times 1$ convolutions and sigmoid activations $\sigma$, the final channel-recalibrated and spatially-aware attention weights $\mathbf{g}^h$ and $\mathbf{g}^w$ are generated. The final output $\mathbf{Y}$ of the NCA module is:
$$ \mathbf{Y} = \mathbf{X} \odot (\mathbf{g}^w \otimes \mathbf{g}^h) \odot \sigma(\gamma \cdot \text{GAP}(\mathbf{X})) $$
where $\odot$ is element-wise multiplication, $\otimes$ is outer product, and GAP is Global Average Pooling. This structure enables the model to pay more precise attention to the often irregular and low-contrast defect areas within a casting part.
2. BiFPN for Multi-Scale Feature Fusion in Casting Parts
The neck of the network is responsible for fusing features extracted from different layers of the backbone, which correspond to different semantic levels and receptive fields. For a casting part containing defects at vastly different scales, effective multi-scale fusion is paramount. We replace the original PANet with the more efficient Bidirectional Feature Pyramid Network (BiFPN). BiFPN introduces learnable weights to balance the contribution of different input features during fusion and employs a simplified bidirectional cross-scale connectivity. This design allows features to flow both top-down and bottom-up multiple times across the same level, leading to richer feature representations. The weighted fusion for a node can be expressed as:
$$ \mathbf{P}_{i}^{out} = \text{Conv}\left( \frac{\sum_{j} w_{ij} \cdot \mathbf{P}_{i}^{in_j}}{\sum_{j} w_{ij} + \epsilon} \right) $$
where $\mathbf{P}_{i}^{in_j}$ are the input features to node $i$, $w_{ij}$ are learnable weights for each input, and $\epsilon$ is a small constant for numerical stability. This mechanism allows the network to dynamically emphasize the most informative features for detecting both large and minute defects in the casting part.
3. EIoU Loss for Precise Defect Localization
Accurate bounding box regression is essential for precisely locating the defect area within the casting part. The CIoU loss, used in the original YOLOv5, considers the overlap area, center point distance, and aspect ratio. However, its aspect ratio term does not accurately reflect the true differences in width and height. We adopt the EIoU Loss, which directly minimizes the differences in width and height between the predicted box and the target box, leading to faster and more accurate convergence. The EIoU Loss is defined as:
$$ \mathcal{L}_{EIoU} = \mathcal{L}_{IoU} + \mathcal{L}_{dis} + \mathcal{L}_{asp} = 1 – IoU + \frac{\rho^2(\mathbf{b}, \mathbf{b}^{gt})}{c^2} + \frac{\rho^2(w, w^{gt})}{C_w^2} + \frac{\rho^2(h, h^{gt})}{C_h^2} $$
where:
$IoU$ is the intersection over union.
$\rho^2(\mathbf{b}, \mathbf{b}^{gt})$ is the squared Euclidean distance between the predicted center $\mathbf{b}$ and the ground truth center $\mathbf{b}^{gt}$.
$c$ is the diagonal length of the smallest enclosing box covering both predicted and ground truth boxes.
$\rho^2(w, w^{gt})$ and $\rho^2(h, h^{gt})$ are the squared differences in width and height.
$C_w$ and $C_h$ are the width and height of the smallest enclosing box.
This decomposition allows the regression loss to optimize the width and height discrepancies independently, which is particularly beneficial for accurately framing elongated cracks or irregular-shaped porosity in a casting part.
Experimental Setup and Data Preparation
The dataset consists of X-ray images collected from the production line of an aluminum alloy foundry. The images feature various casting parts, including automotive housings and communication device bodies. Defects were annotated by experts into four categories: Shrinkage Porosity, Gas Hole, Inclusion, and Crack. The dataset was partitioned and enhanced to improve model generalization for casting part inspection, as detailed below.
| Data Processing Step | Description | Purpose for Casting Part Model |
|---|---|---|
| Original Collection | 1,374 X-ray images (1000×1000 pixels) of defective casting parts. | Provides real-world, industrial-grade data. |
| Annotation | Bounding box and class label for each defect instance. | Supervised learning ground truth. |
| Augmentation | Mosaic, random flipping, scaling, HSV jitter. | Increases data diversity and model robustness. |
| Contrast Enhancement | Histogram Equalization & Gamma Correction applied to a subset. | Simulates varying X-ray exposure, helps model learn from weak-contrast defects in casting parts. |
| Train/Val/Test Split | Ratio of 8:1:1 (random split). | Standard practice for training and evaluation. |
The final distribution of defect instances across the splits is shown in the following table, highlighting the prevalence of shrinkage porosity, a common defect in casting parts.
| Defect Class | Training Set | Validation Set | Test Set | Total |
|---|---|---|---|---|
| Shrinkage Porosity | 3786 | 529 | 663 | 4978 |
| Gas Hole | 1510 | 243 | 207 | 1960 |
| Inclusion | 547 | 97 | 106 | 750 |
| Crack | 375 | 39 | 44 | 458 |
The models were trained using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. The input image size was standardized to 640×640 pixels. The performance was evaluated using standard metrics: Precision (P), Recall (R), the harmonic mean F1-Score, and the mean Average Precision at an Intersection over Union (IoU) threshold of 0.5 (mAP@0.5). The frames per second (FPS) were measured to assess inference speed. The formulas for key metrics are:
$$ \text{Precision} (P) = \frac{TP}{TP + FP}, \quad \text{Recall} (R) = \frac{TP}{TP + FN} $$
$$ F_1 = 2 \cdot \frac{P \cdot R}{P + R}, \quad mAP@0.5 = \frac{1}{N_C} \sum_{c=1}^{N_C} AP_c(0.5) $$
where $TP$, $FP$, and $FN$ are true positives, false positives, and false negatives, respectively, and $AP_c$ is the Average Precision for class $c$.
Results and Analysis
Ablation Study
Ablation experiments were conducted to validate the contribution of each proposed component to the detection performance on casting part defects. The baseline YOLOv5s model achieved an mAP@0.5 of 91.11%. The incremental addition of the NCA module, BiFPN, and EIoU loss each brought improvements. The complete X-YOLOv5s model, integrating all three components, attained the highest mAP@0.5 of 93.89%, representing a significant gain of 2.78 percentage points over the baseline. The results confirm that each modification effectively addresses specific challenges in casting part defect detection. The detailed per-class results are presented in the following table:
| Model Configuration | mAP@0.5 (%) | Shrinkage (%) | Gas Hole (%) | Inclusion (%) | Crack (%) |
|---|---|---|---|---|---|
| YOLOv5s (Baseline) | 91.11 | 94.18 | 92.64 | 90.15 | 87.47 |
| + NCA | 93.23 | 97.04 | 94.35 | 92.83 | 88.70 |
| + BiFPN | 92.65 | 95.98 | 94.11 | 92.56 | 87.95 |
| + EIoU Loss | 91.42 | 94.53 | 93.24 | 90.33 | 87.58 |
| X-YOLOv5s (Full) | 93.89 | 97.43 | 95.12 | 93.64 | 89.37 |
Comparative Analysis with State-of-the-Art Models
The proposed X-YOLOv5s model was compared against a range of popular object detection models to evaluate its comprehensive performance on the casting part defect dataset. The comparison includes two-stage detectors (Faster R-CNN, Cascade R-CNN), lightweight models (YOLOv4-tiny), and recent high-performance single-stage models (YOLOx-s, YOLOv7, YOLOv8s). The results, summarized in the table below, demonstrate the superiority of the proposed method.
| Model | Params (M) | GFLOPs | Recall (%) | Precision (%) | F1-Score (%) | FPS | mAP@0.5 (%) |
|---|---|---|---|---|---|---|---|
| Faster R-CNN | 60.2 | 523.9 | 62.45 | 64.32 | 63.37 | 2.6 | 62.16 |
| Cascade R-CNN | 88.0 | 543.7 | 63.16 | 66.32 | 64.71 | 2.1 | 65.29 |
| YOLOv4-tiny | 5.9 | 16.2 | 80.18 | 82.57 | 79.36 | 7.1 | 74.46 |
| YOLOx-s | 9.0 | 26.8 | 86.76 | 88.61 | 87.68 | 3.8 | 88.40 |
| YOLOv7 | 37.3 | 105.3 | 89.98 | 92.30 | 91.13 | 10.3 | 92.03 |
| YOLOv8s | 11.1 | 28.8 | 90.03 | 92.69 | 91.34 | 9.3 | 92.32 |
| YOLOv5s (Baseline) | 7.0 | 16.1 | 88.43 | 91.93 | 90.15 | 8.9 | 91.11 |
| X-YOLOv5s (Proposed) | 9.5 | 18.8 | 89.75 | 93.34 | 91.50 | 8.1 | 93.89 |
Key observations from the comparison are:
1. Superior Accuracy: The proposed X-YOLOv5s achieves the highest mAP@0.5 (93.89%) and Precision (93.34%) among all compared models, indicating its exceptional ability to accurately identify and localize defects in casting parts with minimal false positives.
2. Balanced Performance: While the F1-Score (91.50%) is very competitive, slightly trailing YOLOv7 and YOLOv8s, the proposed model achieves this with significantly fewer parameters and computational cost (GFLOPs). This highlights its efficiency.
3. Practical Inference Speed: With an inference speed of 8.1 FPS on the given hardware, the model is sufficiently fast for many industrial inspection scenarios involving casting parts, where throughput is often measured in seconds per part rather than frames per second.
4. Efficiency Advantage: Compared to its direct baseline YOLOv5s, X-YOLOv5s delivers a substantial 2.78% boost in mAP with only a modest increase in parameters (from 7.0M to 9.5M) and a minor reduction in FPS (from 8.9 to 8.1). This represents an excellent trade-off, making it highly suitable for practical deployment in casting part quality control.
Conclusion
This work presents a robust deep learning-based framework, X-YOLOv5s, specifically tailored for the automated detection of internal defects in casting parts using X-ray imagery. The core challenges of small defect size, weak contrast, and irregular morphology inherent to casting part inspection are addressed through three synergistic enhancements: the integration of an NCA attention module to refine feature focus on defect regions; the adoption of a BiFPN structure for superior multi-scale feature fusion; and the application of EIoU Loss for precise bounding box regression. Comprehensive experiments on a real-world industrial dataset demonstrate that the proposed model outperforms existing state-of-the-art detectors in terms of detection accuracy (mAP@0.5 of 93.89%) while maintaining a practical inference speed. The ablation study confirms the individual contribution of each component. Therefore, the X-YOLOv5s model offers a reliable and efficient solution for enhancing quality assurance in the casting industry, capable of identifying critical internal flaws in casting parts that might otherwise escape manual inspection.
