In industrial manufacturing, the detection of metal casting defects is critical for ensuring product quality and safety. As a key component in automotive systems, aluminum alloy wheels must undergo rigorous inspection to identify internal flaws such as porosity, shrinkage, cracks, and micro-porosity. Traditional methods for detecting these metal casting defects often rely on manual examination of X-ray images, which is time-consuming, labor-intensive, and prone to human error. With advancements in deep learning, automated defect detection systems have emerged as a promising solution. In this study, I propose an optimized YOLOv8-based algorithm to enhance the detection of metal casting defects in aluminum alloy wheels, addressing issues of low efficiency, accuracy, and recognition capability in previous approaches.
The core of my method involves several key improvements to the YOLOv8 architecture. First, I integrate a Focus layer into the backbone network to perform slicing operations, which increases the channel depth while preventing information loss, thereby improving the global receptive field. This is particularly important for capturing small and irregular metal casting defects that may be obscured by complex wheel structures. Second, I replace the spatial pyramid pooling module with a simplified version, SimSPPF, to reduce computational complexity without sacrificing performance. Finally, I incorporate the BoTNet module into the backbone, which leverages multi-head self-attention mechanisms to enhance feature extraction and reduce parameters. These modifications collectively improve the model’s ability to localize and classify various metal casting defects in real-time industrial settings.
To validate the effectiveness of my approach, I conducted extensive experiments on a custom dataset comprising X-ray images of aluminum alloy wheels. The dataset includes 2,154 images resized to 640×640 pixels, annotated with four types of metal casting defects: porosity, shrinkage, cracks, and micro-porosity. The experimental environment was set up with an NVIDIA GeForce RTX 2080 Ti GPU, and the model was trained for 500 epochs using the Adam optimizer with a learning rate of 0.01. Evaluation metrics such as mean average precision (mAP), recall, precision, and frames per second (FPS) were used to assess performance. The results demonstrate that my optimized algorithm achieves a mAP of 98.80% and an average detection speed of 75.53 FPS, outperforming existing models like YOLOv5, YOLOv7, and YOLOv8x in detecting metal casting defects.
The problem of detecting metal casting defects in aluminum alloy wheels is complex due to the variability in defect shapes, sizes, and locations. Common metal casting defects, such as porosity caused by gas entrapment during solidification or shrinkage resulting from uneven cooling, can compromise wheel integrity. X-ray imaging is widely used for non-destructive testing, but manual inspection is inefficient. Previous deep learning methods, including Faster R-CNN and Mask R-CNN, have shown promise but often suffer from high computational costs or limited accuracy. My approach builds on YOLOv8, a state-of-the-art object detection model known for its speed and accuracy, by introducing enhancements tailored to the challenges of metal casting defects. The Focus layer, for instance, splits the input image into multiple lower-resolution feature maps through interval sampling, concatenating them along the channel dimension to preserve spatial information. This process can be represented mathematically as follows: given an input image tensor of dimensions $H \times W \times C$, the Focus layer produces an output tensor of dimensions $\frac{H}{2} \times \frac{W}{2} \times 4C$, effectively expanding the channels without losing details. This is crucial for identifying subtle metal casting defects that might otherwise be missed.
In the backbone network, I replace the standard convolutional layers with the Focus layer to enhance feature extraction. The operation involves splitting the input into four complementary sub-images, which are then concatenated. For example, if the input is a 608×608×3 image, the Focus layer outputs a 304×304×12 feature map. This reduces the spatial dimensions while increasing the channel depth, allowing the network to capture more contextual information. The modified backbone also includes the BoTNet module, which integrates multi-head self-attention (MHSA) to improve global feature representation. The self-attention mechanism computes query (Q), key (K), and value (V) vectors from the input features, enabling the model to focus on relevant regions for detecting metal casting defects. The attention score is calculated as:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
where $d_k$ is the dimension of the key vectors. By using multiple attention heads, the model can simultaneously attend to different aspects of the input, enhancing its ability to distinguish between various metal casting defects. Additionally, the SimSPPF module replaces the original SPPF to optimize pooling operations. SimSPPF employs ReLU activation instead of SiLU, reducing computational overhead while maintaining performance. The pooling process involves multiple max-pooling layers with kernel sizes of 5×5, applied sequentially to capture features at different scales. This is expressed as:
$$ \text{SimSPPF}(x) = \text{Concat}(\text{MaxPool}_{5\times5}(x), \text{MaxPool}_{5\times5}(x), \text{MaxPool}_{5\times5}(x)) $$
where $\text{Concat}$ denotes concatenation along the channel dimension. This hierarchical pooling helps in handling metal casting defects of varying sizes, from large shrinkage cavities to fine cracks.
The overall network architecture consists of three main components: the backbone for feature extraction, the feature pyramid network (FPN) for multi-scale fusion, and the head for prediction. The backbone outputs feature maps at resolutions of 80×80, 40×40, and 20×20, which are then fused in the FPN to combine low-level and high-level features. The prediction head utilizes a decoupled structure to output bounding boxes, confidence scores, and class labels for the metal casting defects. To quantify the improvements, I conducted ablation studies comparing different configurations of the proposed modules. The results are summarized in the table below:
| Focus Layer | BoTNet Module | mAP (%) |
|---|---|---|
| No | No | 93.48 |
| Yes | No | 95.60 |
| No | Yes | 97.33 |
| Yes | Yes | 98.80 |
As shown, integrating both the Focus layer and BoTNet module yields the highest mAP, demonstrating their synergistic effect on improving detection accuracy for metal casting defects. Furthermore, I compared my method with other YOLO variants using the same dataset. The performance metrics, including mAP and FPS, are presented in the following table:
| Model | mAP (%) | FPS | Porosity | Shrinkage | Cracks | Micro-porosity |
|---|---|---|---|---|---|---|
| YOLOv5 | 92.79 | 73.40 | 0.50 | 0.69 | 0.84 | N/A |
| YOLOv7 | 91.27 | 76.15 | 0.64 | 0.55 | 0.67 | 0.69 |
| YOLOv8x | 93.48 | 80.60 | 0.80 | 0.48 | 0.78 | 0.84 |
| Proposed Method | 98.80 | 75.53 | 0.72 | 0.74 | 0.79 | 0.80 |
The values under defect types represent the average confidence scores for correct detections. My method achieves the highest mAP, indicating superior overall performance in identifying metal casting defects. Although the FPS is slightly lower than YOLOv8x, it remains sufficient for real-time industrial applications. The confidence scores for individual defect types are more balanced and higher in most cases, reducing the risk of missed detections. For instance, shrinkage defects, which are often challenging to detect due to their irregular shapes, show a significant improvement in confidence from 0.48 in YOLOv8x to 0.74 in my approach.
To further illustrate the detection process, consider the following example of an X-ray image showing typical metal casting defects. The image highlights areas where porosity and shrinkage occur, often in regions with complex geometries. My algorithm processes such images by first applying the Focus layer to extract high-level features, then using the BoTNet module to assign attention weights to potential defect regions. The SimSPPF module aggregates multi-scale features to ensure that both small and large metal casting defects are captured. The output includes bounding boxes with class labels and confidence scores, enabling quick decision-making in quality control.

In terms of mathematical formulation, the loss function used in training combines classification loss, localization loss, and confidence loss. The classification loss is based on cross-entropy, while the localization loss uses Complete IoU (CIoU) to improve bounding box accuracy. The total loss can be expressed as:
$$ L_{\text{total}} = \lambda_{\text{cls}} L_{\text{cls}} + \lambda_{\text{box}} L_{\text{box}} + \lambda_{\text{obj}} L_{\text{obj}} $$
where $L_{\text{cls}}$ is the classification loss, $L_{\text{box}}$ is the CIoU loss for bounding box regression, and $L_{\text{obj}}$ is the objectness loss. The coefficients $\lambda_{\text{cls}}$, $\lambda_{\text{box}}$, and $\lambda_{\text{obj}}$ are set to 0.5, 0.05, and 1.0, respectively, to balance the contributions. This loss function ensures that the model effectively learns to distinguish between different types of metal casting defects while accurately localizing them.
The dataset used in this study was augmented through techniques such as rotation, scaling, and flipping to increase diversity and prevent overfitting. This is essential for generalizing to various metal casting defects encountered in real-world scenarios. The training process involved splitting the data into 75% for training and 25% for validation, with batch size set to 16. The learning rate scheduler employed cosine annealing to gradually reduce the learning rate, promoting convergence. The precision (P) and recall (R) are defined as:
$$ P = \frac{TP}{TP + FP} $$
$$ R = \frac{TP}{TP + FN} $$
where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. The average precision (AP) for each defect class is computed as the area under the precision-recall curve, and mAP is the mean of AP values across all classes. My method achieves high precision and recall for all metal casting defects, as evidenced by the mAP of 98.80%.
In conclusion, I have developed an optimized YOLOv8-based algorithm that significantly improves the detection of metal casting defects in aluminum alloy wheels. By incorporating a Focus layer, SimSPPF module, and BoTNet with multi-head self-attention, the model achieves high accuracy and real-time performance. The experimental results confirm that my approach outperforms existing methods in terms of mAP and provides balanced detection across various defect types. This makes it suitable for industrial applications where rapid and reliable inspection of metal casting defects is crucial. Future work could explore further optimizations, such as knowledge distillation or adversarial training, to enhance robustness against noisy or low-quality X-ray images. Ultimately, advancing automated defect detection systems will contribute to higher product quality and safety in the manufacturing of critical components like aluminum alloy wheels.
