In the manufacturing industry, casting parts play a critical role due to their widespread application in mechanical systems. However, during the production process, weld surface defects in casting parts, such as pores, cracks, slag inclusions, weld tumors, and lack of fusion, significantly compromise the structural integrity and service life of these components. Traditional inspection methods, which rely heavily on manual visual assessment or microscopic image analysis, are often labor-intensive, time-consuming, and prone to human error, making them unsuitable for modern industrial demands. With the advent of deep learning, computer vision techniques have emerged as powerful tools for automating defect detection. Among these, object detection algorithms like YOLOv3 have shown promise, but they face challenges when applied to casting part weld defects, including limited datasets, complex environmental conditions, low recognition accuracy for small targets, and high computational costs. To address these issues, I propose an enhanced YOLOv3 algorithm, termed YOLOv3-GN, which incorporates lightweight network architectures, advanced feature extraction modules, and optimized loss functions to improve detection performance specifically for casting part weld surfaces.
The core of my approach revolves around modifying the standard YOLOv3 framework to enhance its efficiency and accuracy. The original YOLOv3 utilizes Darknet-53 as its backbone network, which, while effective, involves a substantial number of parameters and computational overhead. For casting part inspection, where real-time processing is often required in resource-constrained environments, this can be a bottleneck. Therefore, I replace the backbone with GhostNet, a lightweight convolutional neural network designed to generate more features using cheap operations. GhostNet reduces redundancy by employing ghost modules that apply linear transformations to intrinsic feature maps, thereby decreasing parameters and computations without significant loss in representational capacity. The modified backbone network structure for YOLOv3-GN is summarized in Table 1, detailing the input sizes, operations, output channels, and strides.
| Index | Input | Operation | Output Channels | Stride |
|---|---|---|---|---|
| 1 | 640² × 3 | Conv2d 3×3 | 16 | 2 |
| 2 | 320² × 16 | G-bneck | 16 | 1 |
| 3 | 320² × 16 | G-bneck | 24 | 2 |
| 4 | 160² × 24 | G-bneck | 24 | 1 |
| 5 | 160² × 24 | G-bneck | 40 | 2 |
| 6 | 80² × 40 | G-bneck | 40 | 1 |
| 7 | 80² × 40 | G-bneck | 80 | 2 |
| 8 | 40² × 80 | G-bneck | 80 | 1 |
| 9 | 40² × 80 | G-bneck | 80 | 1 |
| 10 | 40² × 80 | G-bneck | 80 | 1 |
| 11 | 40² × 80 | G-bneck | 112 | 1 |
| 12 | 20² × 112 | G-bneck | 112 | 1 |
| 13 | 20² × 112 | G-bneck | 160 | 2 |
| 14 | 10² × 160 | G-bneck | 160 | 1 |
| 15 | 10² × 160 | G-bneck | 160 | 1 |
| 16 | 10² × 160 | G-bneck | 160 | 1 |
| 17 | 10² × 160 | G-bneck | 160 | 1 |
In this table, G-bneck refers to the ghost bottleneck module, which consists of two ghost modules with depthwise convolution inserted when the stride is 2. This design significantly reduces the model parameters while maintaining feature extraction capability, crucial for handling casting part images that may contain intricate weld details. The input image size is set to 640×640×3 to balance resolution and computational efficiency. After the backbone network, the feature maps are processed through a feature pyramid network (FPN) for multi-scale detection. To further enhance the model’s ability to capture contextual information across different scales, I integrate a spatial pyramid pooling structure at the output of the last backbone layer. Specifically, I adopt an improved version called LeakSPPF, which replaces the SiLU activation function with Leaky ReLU to reduce computational complexity. The LeakSPPF structure performs max pooling at multiple kernel sizes (e.g., 5×5, 9×9, 13×13) and concatenates the results, thereby increasing the receptive field and improving robustness to varying defect sizes in casting parts. The operation can be represented as follows: given an input feature map \(X\), the output \(Y\) is obtained through parallel pooling layers and convolution:
$$Y = \text{Concat}\left(\text{MaxPool}_{k_1}(X), \text{MaxPool}_{k_2}(X), \text{MaxPool}_{k_3}(X), X\right)$$
where \(k_1, k_2, k_3\) denote different kernel sizes. This is followed by a convolutional layer to reduce channel dimensions, effectively fusing multi-scale features essential for detecting both large and small defects on casting part weld surfaces.
To address the challenge of small defect detection, such as pores in casting parts, I incorporate attention mechanisms into the FPN. The Squeeze-and-Excitation (SE) module is used to recalibrate channel-wise feature responses by modeling interdependencies between channels. This allows the network to focus more on informative features and suppress less useful ones. The SE operation involves global average pooling to squeeze spatial information, followed by two fully connected layers with activation functions to generate channel weights. Mathematically, for an input feature map \(U \in \mathbb{R}^{H \times W \times C}\), the squeeze operation computes channel statistics \(z_c\):
$$z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} U_c(i, j)$$
Then, the excitation operation produces weights \(s_c\):
$$s = \sigma(W_2 \delta(W_1 z))$$
where \(\sigma\) denotes the sigmoid function, \(\delta\) is the ReLU activation, and \(W_1, W_2\) are weights of the fully connected layers. The final output is obtained by scaling the input features: \(\tilde{U}_c = s_c \cdot U_c\). Additionally, to prevent dimension loss during feature fusion, I insert 1×1 convolutions before the SE modules in the FPN, expanding channel numbers and preserving critical information for casting part defect characterization.
Another key improvement is the adoption of Focal Loss to handle class imbalance, which is common in casting part weld defect datasets where some defect types may be underrepresented. The standard cross-entropy loss can be dominated by easy-to-classify samples, leading to poor performance on rare defect categories. Focal Loss introduces a modulating factor to down-weight easy examples and focus training on hard negatives. The loss function is defined as:
$$FL(p_t) = -\alpha_t (1 – p_t)^\gamma \log(p_t)$$
Here, \(p_t\) represents the model’s estimated probability for the true class, \(\alpha_t\) is a balancing factor for class frequencies, and \(\gamma\) is a tunable focusing parameter. By setting \(\gamma > 0\), the loss for well-classified samples is reduced, emphasizing challenging cases like small pores or fine cracks in casting parts. This adjustment enhances the model’s sensitivity to positive samples, improving overall detection accuracy.
The overall architecture of YOLOv3-GN is depicted in a structural diagram, which illustrates the flow from input casting part images through the GhostNet backbone, LeakSPPF module, enhanced FPN with SE attention, and finally to the detection heads that output bounding boxes and class predictions. The model outputs three feature maps at different scales (80×80, 40×40, 20×20) to detect objects of varying sizes, ensuring comprehensive coverage of weld defects on casting parts.

To validate the proposed method, I conducted extensive experiments using a custom dataset of casting part weld surfaces. The dataset initially comprised 861 images with five defect categories: weld tumor, slag inclusion, crack, pore, and lack of fusion. Each image was resized to 800×800 pixels to reduce memory usage. To augment the dataset and improve model generalization, I applied various data augmentation techniques, including random rotation, horizontal translation, brightness adjustment, and noise addition. After augmentation, the dataset expanded to 4305 images, which were split into training, validation, and test sets in an 8:1:1 ratio. This augmentation process is crucial for simulating real-world variations in casting part inspection environments, such as lighting changes and surface irregularities.
The experimental environment was based on a Linux operating system with hardware specifications including an Intel Xeon Gold 6246C CPU and four NVIDIA Tesla P40 GPUs, each with 24 GB of memory. Software tools included Python 3.7, CUDA 10.1 for GPU acceleration, and deep learning frameworks like PyTorch. The model was trained from scratch for 300 epochs with an initial learning rate of 0.01, momentum of 0.937, and batch size of 32. The stochastic gradient descent (SGD) optimizer was employed along with a cosine annealing learning rate scheduler to dynamically adjust the learning rate during training. These settings ensure stable convergence and effective learning of casting part defect features.
Evaluation metrics included average precision (AP), mean average precision (mAP), model parameters (MP), and frames per second (FPS). AP is computed as the area under the precision-recall curve for each defect class, while mAP averages AP across all classes. For casting part applications, both accuracy and efficiency are critical, so MP and FPS measure model complexity and detection speed, respectively. The formulas are as follows:
$$AP = \int_0^1 P(R) \, dR \approx \sum_{n=1}^N P(n) \Delta R$$
$$mAP = \frac{1}{m} \sum_{j=1}^m AP(j)$$
where \(P\) denotes precision, \(R\) is recall, \(m\) is the number of defect classes, and the integral approximates the continuous precision-recall curve.
I compared YOLOv3-GN against several baseline models, including original YOLOv3, SSD, and Faster R-CNN, to demonstrate its superiority. The results are summarized in Table 2. As shown, YOLOv3-GN achieves a mAP of 90.97%, which is 1.55% higher than YOLOv3, 26.43% higher than SSD, and 10.44% higher than Faster R-CNN. Notably, for small target detection like pores in casting parts, the AP value improves by 4% compared to YOLOv3. Moreover, the model parameters are reduced to 31.28 M, which is approximately half of YOLOv3’s parameters (59.44 M), and the inference speed reaches 17.46 FPS, indicating faster processing suitable for real-time casting part inspection.
| Model | mAP (%) | AP for Pore (%) | MP (M) | FPS (frames/s) |
|---|---|---|---|---|
| YOLOv3 | 89.42 | 71 | 59.44 | 12.68 |
| SSD | 64.54 | 32 | 24.14 | 29.34 |
| Faster R-CNN | 80.53 | 53 | 136.77 | 12.50 |
| YOLOv3-GN (Proposed) | 90.97 | 75 | 31.28 | 17.46 |
To further analyze the contribution of each improvement component, I conducted ablation studies by incrementally adding modifications to the base model (GhostNet backbone). The results are presented in Table 3. Experiment 1 serves as the baseline with only GhostNet. Adding LeakSPPF (Experiment 2) increases mAP by 1.30%, demonstrating the benefit of multi-scale feature fusion for casting part defects. Incorporating the SE attention mechanism (Experiment 3) boosts mAP by 0.71%, highlighting its role in emphasizing relevant features. Using Focal Loss (Experiment 4) improves mAP by 1.54%, indicating better handling of class imbalance. The addition of 1×1 convolutions (Experiment 5) contributes a 1.16% gain by preserving dimensional information. When all components are combined (Experiment 7), the mAP reaches 90.97%, a 3.33% improvement over the baseline, albeit with a slight decrease in FPS due to added complexity. However, the trade-off is acceptable given the significant accuracy enhancement for casting part weld inspection.
| Experiment | LeakSPPF | SE | Focal Loss | 1×1 Convolution | mAP (%) | FPS (frames/s) |
|---|---|---|---|---|---|---|
| 1 | 87.64 | 19.88 | ||||
| 2 | √ | 88.94 | 17.78 | |||
| 3 | √ | 88.35 | 18.09 | |||
| 4 | √ | 89.18 | 19.23 | |||
| 5 | √ | 88.80 | 18.81 | |||
| 6 | √ | √ | √ | 89.16 | 17.67 | |
| 7 | √ | √ | √ | √ | 90.97 | 17.46 |
Visual inspection of detection results on unseen casting part images further confirms the effectiveness of YOLOv3-GN. Compared to the original YOLOv3, the proposed model produces higher confidence scores for defects and reduces false positives and missed detections. For instance, in images with multiple weld tumors, YOLOv3-GN avoids duplicate predictions and accurately localizes all instances. Similarly, for small pores and fine cracks, the attention mechanisms and multi-scale features enable better recognition, whereas the baseline model may overlook these subtle defects. This robustness is essential for ensuring the quality and safety of casting parts in industrial applications.
The lightweight nature of YOLOv3-GN, achieved through GhostNet and optimized modules, makes it suitable for deployment in edge computing devices or real-time monitoring systems. In casting part manufacturing lines, rapid defect detection can prevent faulty products from progressing, reducing waste and maintenance costs. The model’s ability to handle complex backgrounds and varying lighting conditions, thanks to data augmentation and enhanced feature extraction, aligns well with the unpredictable environments often encountered in foundries and welding shops.
In conclusion, the improved YOLOv3 algorithm, YOLOv3-GN, offers a compelling solution for casting part weld surface defect detection. By integrating GhostNet for lightweight backbone design, LeakSPPF for multi-scale context aggregation, SE attention for feature recalibration, and Focal Loss for class imbalance mitigation, the model achieves a balance between accuracy and efficiency. Experimental results on a custom casting part dataset demonstrate superior performance in terms of mAP, parameter reduction, and inference speed compared to existing methods. Future work could explore further optimizations, such as neural architecture search for custom lightweight networks or integration with generative adversarial networks for synthetic data generation to expand casting part defect varieties. Overall, this approach contributes to advancing automated quality inspection in the casting industry, ensuring the reliability and durability of critical mechanical components.
