Efficient Defect Detection in Casting Components Using an Enhanced Lightweight Neural Network

In modern industrial manufacturing, the quality inspection of metal castings, particularly large and critical components like railway couplers, remains a significant challenge. Traditional manual inspection for casting defects is labor-intensive, time-consuming, and prone to human error and inconsistency. The automation of this process is therefore crucial for improving production efficiency and ensuring product reliability. Computer vision, especially deep learning-based object detection, offers a promising solution. However, deploying such models in real industrial environments presents unique challenges: the models must be fast enough for real-time processing on potentially limited hardware, accurate enough to identify subtle and varied defects, and robust enough to handle challenging conditions like variable lighting and complex part geometries.

The target in this work is the automated detection of surface defects on large wagon coupler castings. Common casting defects in such components include pores, sand inclusions, cold shuts, and cracks. Detecting these casting defects from 2D images is difficult because the flaws often exhibit low color contrast with the base metal, have irregular and variable shapes, and can appear in cluttered backgrounds. This necessitates a detection algorithm with strong feature extraction capabilities. While state-of-the-art detectors like YOLOv8 offer high accuracy, their computational complexity can be a bottleneck for real-time applications. This paper proposes a novel defect detection method that balances speed and accuracy by enhancing the YOLOv8n architecture. The core innovations involve replacing the original backbone with the lightweight MobileNetV3 network to drastically reduce computational load and integrating a novel attention module, SA-BRA, to recover and enhance the model’s feature discrimination power for pinpointing casting defects.

The proposed system is designed for integration into an automated cleaning or processing station. A camera captures images of the coupler casting, and the algorithm must quickly and accurately locate any surface casting defects, outputting their bounding boxes and classes. This information can then guide robotic tools for grinding, welding, or marking. The requirement for near-instantaneous feedback makes inference speed (Frames Per Second, FPS) a critical metric alongside detection precision. The primary contribution of this work is a refined YOLOv8n model that achieves this balance through strategic architectural modifications, validated on a dedicated dataset of coupler casting defect images.

1. Methodology

1.1 Base Detector: YOLOv8n Architecture

The YOLOv8 family represents the latest evolution of the “You Only Look Once” single-stage detection paradigm, known for its excellent balance of speed and accuracy. For industrial applications where computational resources may be constrained, the ‘n’ (nano) version, YOLOv8n, is an appropriate starting point due to its smaller parameter count. Its architecture consists of four main parts:

Input: Employs adaptive image scaling and Mosaic data augmentation during training to improve robustness.
Backbone: The feature extractor, built primarily with a modified CSPDarknet structure featuring C2f modules. The C2f module, inspired by ELAN, uses a split and concatenate strategy with multiple bottleneck blocks to enrich gradient flow and capture features at different scales, which is beneficial for multi-scale casting defects.
Neck: A Feature Pyramid Network (FPN) combined with a Path Aggregation Network (PAN). This structure facilitates bi-directional feature fusion, combining semantically rich, high-level features from deeper layers with spatially precise, low-level features from shallower layers. This is crucial for detecting small casting defects that retain fine spatial details.
Head: A decoupled head structure that separates the tasks of classification and bounding box regression, leading to more focused and effective learning for each task. It uses an Anchor-Free approach, predicting object centers directly, which simplifies the design and can improve generalization.

While YOLOv8n is efficient, its backbone is still relatively heavy for edge deployment. Directly using it for real-time casting defect detection on high-resolution images may not meet stringent speed requirements. Therefore, a lightweight redesign is necessary.

1.2 Lightweight Backbone Replacement with MobileNetV3

To achieve real-time performance, the original YOLOv8n backbone is replaced with MobileNetV3, a highly optimized lightweight network. The key to MobileNet’s efficiency is the depthwise separable convolution. This operation decomposes a standard convolution into two steps, dramatically reducing parameters and computations.

A standard convolution with kernel size $P_K$, input channels $Q_{in}$, output channels $Q_{out}$, and feature map size $P_F$ has a computational cost of:

$$ \text{FLOPs}_{\text{standard}} = P_K^2 \times Q_{in} \times Q_{out} \times P_F^2 $$

In contrast, depthwise separable convolution first applies a depthwise convolution, where a single filter is applied per input channel:

$$ \text{FLOPs}_{\text{depthwise}} = P_K^2 \times Q_{in} \times P_F^2 $$

This is followed by a pointwise convolution (a 1×1 convolution) to combine channels:

$$ \text{FLOPs}_{\text{pointwise}} = 1^2 \times Q_{in} \times Q_{out} \times P_F^2 $$

The total cost for depthwise separable convolution is:

$$ \text{FLOPs}_{\text{separable}} = P_K^2 \times Q_{in} \times P_F^2 + Q_{in} \times Q_{out} \times P_F^2 $$

The ratio of computations shows the efficiency gain:

$$ \frac{\text{FLOPs}_{\text{separable}}}{\text{FLOPs}_{\text{standard}}} = \frac{1}{Q_{out}} + \frac{1}{P_K^2} $$

For a typical 3×3 convolution ($P_K=3$) and a reasonable number of output channels (e.g., $Q_{out}=256$), the depthwise separable convolution uses roughly 8 to 9 times fewer computations. MobileNetV3 builds upon this with inverted residual blocks, squeeze-and-excitation attention for channel-wise feature recalibration, and the efficient h-swish activation function. Replacing the YOLOv8n backbone with MobileNetV3 significantly reduces the model’s parameter count and floating-point operations, directly translating to higher inference speed (FPS).

1.3 Enhanced Feature Attention with the SA-BRA Module

While MobileNetV3 boosts speed, it can lead to a drop in accuracy due to reduced model capacity, which is detrimental for identifying subtle casting defects. To compensate, we introduce a novel Spatial-Aware Bi-level Routing Attention (SA-BRA) module after the lightweight backbone. This module dynamically focuses the network’s attention on the most relevant spatial regions for casting defect detection.

The SA-BRA module combines two powerful mechanisms. The first is the Bi-level Routing Attention (BRA) mechanism. BRA is designed to handle the sparsity of attention in vision tasks efficiently. Instead of applying global attention (which is computationally expensive) or using fixed local windows (which limits the receptive field), BRA dynamically selects relevant regions in a two-step process:

Region-to-Region Routing: The feature map is partitioned into $S \times S$ non-overlapping regions. For each region, average queries $\mathbf{Q}_r$ and keys $\mathbf{K}_r$ are computed. An adjacency matrix $\mathbf{D}_r$ measuring inter-region relevance is calculated: $\mathbf{D}_r = \mathbf{Q}_r \mathbf{K}_r^T$. A top-k operation selects the indices $\mathbf{H}_r$ of the $k$ most relevant regions for each query region.
Token-to-Token Attention: Only the key $\mathbf{K}_g$ and value $\mathbf{V}_g$ tokens from the top-k relevant regions are gathered. Attention is computed within this sparse yet highly relevant set: $\mathbf{C}_{out} = \text{Attention}(\mathbf{Q}, \mathbf{K}_g, \mathbf{V}_g)$.

This process allows the model to gather information from semantically related areas far apart in the image, which is useful for large castings where a casting defect might have contextual cues elsewhere.

The second mechanism is a Spatial Attention Module (SAM). After BRA highlights content-relevant regions, SAM further refines the spatial weighting. It operates by applying both max-pooling and average-pooling along the channel dimension to generate two 2D spatial feature maps. These maps are concatenated and processed by a small convolutional layer and a sigmoid activation to produce a final spatial attention map $\mathbf{M}_s$. This map is element-wise multiplied with the BRA-enhanced features:

$$ \mathbf{F}_{final} = \mathbf{F}_{BRA} \otimes \mathbf{M}_s $$

This sequential application of BRA and SAM—forming the SA-BRA module—ensures that the network first identifies globally relevant regions for a potential casting defect and then finely tunes the spatial focus within those regions, leading to more precise localization of defect boundaries and subtle features.

2. Experimental Setup and Evaluation Metrics

2.1 Dataset Preparation

As there is no publicly available dataset for railway coupler casting defects, a proprietary dataset was constructed. Images of coupler castings were captured in an industrial setting under varying lighting conditions to ensure realism. Four common defect classes were annotated by experts. The dataset comprises 3,276 JPG images, each with a corresponding TXT file containing bounding box and class label information. The data was split into training (70%, 2,293 images), validation (20%, 655 images), and test (10%, 328 images) sets to facilitate model development and unbiased evaluation.

2.2 Implementation Details

Training and evaluation were conducted on a hardware platform with an Intel i5-7300HQ CPU and an NVIDIA GTX 1050 GPU. The software environment included Python 3.8 and PyTorch 1.12.1. The input image size was fixed at 640×640 pixels. The model was trained for 300 epochs using Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. Due to GPU memory constraints, the batch size was set to 2. Data augmentation techniques including mosaic, mixup, and random affine transformations were employed to improve model generalization.

2.3 Evaluation Metrics

The performance of the casting defect detection models was assessed using the following standard metrics:

Precision (P): The fraction of correctly identified defects among all predicted defects. High precision indicates low false alarm rate.
$$ P = \frac{TP}{TP + FP} $$
Recall (R): The fraction of actual defects that were correctly detected. High recall indicates low miss rate.
$$ R = \frac{TP}{TP + FN} $$
Average Precision (AP): The area under the Precision-Recall curve for a single class. It summarizes the trade-off between P and R.
mean Average Precision (mAP): The average of AP over all defect classes. We report mAP@0.5 (IoU threshold = 0.5) and mAP@0.5:0.95 (average mAP for IoU thresholds from 0.5 to 0.95 in steps of 0.05).
$$ mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i $$
where $N$ is the number of defect classes.
Parameters (Params): The total number of trainable parameters in the model, indicating its size and complexity.
Frames Per Second (FPS): The number of images the model can process per second on the test hardware, indicating its inference speed and suitability for real-time casting defect detection.

Here, $TP$ (True Positive) is the number of correctly detected defects, $FP$ (False Positive) is the number of background regions incorrectly reported as defects, and $FN$ (False Negative) is the number of missed defects.

3. Results and Analysis

3.1 Ablation Study

Ablation studies were conducted to validate the contribution of each proposed component. The baseline is the original YOLOv8n model. The results, presented in the table below, demonstrate the impact of sequentially adding the MobileNetV3 backbone and the SA-BRA module.

Model	MobileNetV3	SA-BRA	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Precision (%)	Recall (%)	Params (M)	FPS
YOLOv8n (Baseline)	✗	✗	94.5	79.4	96.7	92.2	3.0	34.9
Model A	✓	✗	90.4	74.6	92.4	89.1	1.2	65.3
Model B	✗	✓	94.7	80.9	97.1	94.5	3.0	34.5
Proposed Model	✓	✓	93.7	77.3	94.3	91.5	1.2	61.8

Analysis: Replacing the backbone with MobileNetV3 (Model A) achieves the primary goal of lightweighting, reducing parameters by 60% (from 3.0M to 1.2M) and more than doubling the FPS (from 34.9 to 65.3). However, this comes at a cost: mAP@0.5 drops by 4.1%, indicating a significant loss in casting defect detection accuracy. Adding only the SA-BRA module to the original backbone (Model B) shows that the attention mechanism itself can slightly improve accuracy (mAP@0.5 increases by 0.2%) without significantly affecting speed, demonstrating its feature-enhancing capability. Finally, the proposed model integrates both components. Compared to the lightweight Model A, the proposed model recovers 3.3% of the lost mAP@0.5 (rising from 90.4% to 93.7%) while maintaining a highly efficient parameter count of 1.2M and an exceptional FPS of 61.8. This represents a superior trade-off, sacrificing only a minimal 4 FPS compared to Model A for a substantial gain in detection accuracy for casting defects. Compared to the baseline, the proposed model is over 77% faster (61.8 vs. 34.9 FPS) with only a 0.8% drop in mAP@0.5, making it far more suitable for real-time industrial inspection.

3.2 Comparison with State-of-the-Art Detectors

The proposed model was compared against other mainstream object detection algorithms on the coupler casting defect dataset. The results are summarized in the following table.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Precision (%)	Recall (%)	Params (M)	FPS
YOLOv5s	89.7	74.9	91.5	86.1	7.2	33.6
YOLOv7	92.1	75.5	92.6	89.3	36.5	27.6
Faster R-CNN	93.2	79.6	93.8	91.5	100.8	18.4
YOLOv8n (Baseline)	94.5	79.4	96.7	92.2	3.0	34.9
Proposed Model	93.7	77.3	94.3	91.5	1.2	61.8

Analysis: The two-stage detector Faster R-CNN achieves high accuracy (mAP@0.5 of 93.2%) but is the slowest (18.4 FPS) and has by far the most parameters, making it impractical for real-time use. Among single-stage YOLO models, the original YOLOv8n offers the best accuracy but moderate speed. The proposed model strikes an outstanding balance. It surpasses YOLOv5s and YOLOv7 in both accuracy and speed. Crucially, it maintains accuracy very close to the heavier YOLOv8n and Faster R-CNN (within 0.8-1.3% mAP@0.5) while being 1.8x faster than YOLOv8n and 3.4x faster than Faster R-CNN. With the lowest parameter count, it is the most lightweight and efficient model, demonstrating clear advantages for deploying casting defect detection in resource-aware industrial environments.

3.3 Practical Detection Performance under Varied Conditions

The robustness of the proposed model was tested under different lighting conditions, a common challenge in industrial settings. Under strong, direct lighting, the model maintained high detection confidence (e.g., 89%, 94%, 94% for different defect types), accurately locating and classifying casting defects despite potential glare and reflections. More importantly, under weak or uneven lighting conditions where defect contrast is very low, the model still performed reliably, achieving confidence scores of 90% and 84% for two different defect types. This robustness can be attributed to the SA-BRA module’s ability to dynamically focus on semantically relevant spatial regions beyond just local pixel intensities, allowing it to identify defect structures even when their appearance is subdued. This validates the model’s practical applicability for casting defect detection in non-ideal, real-world workshop environments.

4. Conclusion

This paper presented a highly efficient and accurate deep learning-based method for detecting surface defects in railway wagon coupler castings. To address the dual demands of real-time processing and high precision in industrial casting defect inspection, the YOLOv8n architecture was strategically enhanced. First, the backbone was replaced with the lightweight MobileNetV3 network, which drastically reduced computational complexity and increased inference speed by leveraging depthwise separable convolutions. Second, to recover the accuracy loss from lightweighting and to enhance feature discrimination for challenging defects, a novel Spatial-Aware Bi-level Routing Attention (SA-BRA) module was designed and integrated. This module combines the dynamic, content-aware region selection of Bi-level Routing Attention with the fine-grained spatial weighting of a Spatial Attention mechanism, forcing the network to concentrate its resources on the most salient areas for defect identification.

Extensive experiments on a dedicated dataset of coupler casting defect images demonstrated the effectiveness of the proposed approach. The final model achieved a mean Average Precision (mAP@0.5) of 93.7% while running at 61.8 frames per second. This represents a superior speed-accuracy trade-off compared to other state-of-the-art detectors like YOLOv5s, YOLOv7, Faster R-CNN, and the original YOLOv8n. The model also showed strong robustness under varying lighting conditions, a critical requirement for industrial deployment. The proposed system meets the essential criteria for automated casting defect inspection: real-time performance, high accuracy, and operational robustness. Future work will focus on expanding the dataset with more defect varieties and extreme lighting scenarios to further improve model generalization, and on optimizing the model for deployment on embedded hardware at the edge of the production line.