Enhanced Lightweight Weld Defect Detection in Large Casting Parts Using Modified YOLOv7-tiny

The manufacturing sector, particularly in heavy industries, relies heavily on the integrity of large-scale metal structures. Among these, casting parts are fundamental components. Ensuring the quality of welds in these large casting parts is paramount, as defects like weld tumors, slag inclusions, cracks, porosity, underfill, and arc craters can severely compromise structural integrity, leading to potential equipment failure and significant safety hazards. Traditional manual inspection methods are not only labor-intensive and time-consuming but also susceptible to human error and inconsistency, especially in harsh industrial environments. Therefore, the development of automated, robust, and real-time defect detection systems is an urgent necessity for modern smart manufacturing.

Deep learning, especially Convolutional Neural Network (CNN)-based object detection algorithms, has revolutionized automated visual inspection. Single-stage detectors like the YOLO (You Only Look Once) series have gained prominence due to their excellent balance between speed and accuracy, making them suitable for real-time applications. However, directly applying standard models to weld defect detection in large casting parts presents unique challenges. The defects themselves are often small, irregular in shape, and exhibit low contrast against the textured background of the casting surface. Furthermore, the detection system must often operate with limited computational resources on the factory floor, necessitating a model that is both highly accurate and computationally lightweight.

While versions like YOLOv7 offer high accuracy, their computational cost can be prohibitive for edge deployment. The tiny variant, YOLOv7-tiny, is designed for speed and lower parameter count but sometimes at the expense of detection precision, particularly for challenging, irregularly shaped defects common in casting part welds. This work addresses these specific gaps by proposing a series of targeted improvements to the YOLOv7-tiny architecture, creating a model optimized for the precise, fast, and lightweight detection of weld defects in large casting parts.

Proposed Improved YOLOv7-tiny Architecture

The core objective is to enhance the feature representation and computational efficiency of YOLOv7-tiny specifically for the task of weld flaw identification in casting parts. The original YOLOv7-tiny network consists of a backbone for initial feature extraction and a head for multi-scale feature fusion and prediction. Our modifications are strategically integrated into both segments to improve performance on irregular defects, reduce model size, and boost sensitivity to small targets. The overall architectural philosophy is summarized in the following table:

Network Component	Original Module	Proposed Improvement	Primary Benefit
Backbone Feature Extraction	Standard 3×3 Convolution in later stages	Deformable Convolution (DCNv2)	Adapts to irregular defect shapes in casting parts
Efficient Layer Aggregation (ELAN) Blocks	ELAN with standard convolutions	ELAN-PCS (Partial Convolution + SimAM)	Reduces parameters/computation; enhances feature focus
Head & Feature Fusion Paths	Standard Convolutional Paths	Integration of SE (Squeeze-and-Excitation) Attention	Boosts sensitivity to small/medium defect features

1. Deformable Convolution for Irregular Defect Geometry

Weld defects in casting parts, such as meandering cracks or scattered porosity, do not conform to regular, rectangular patterns. Standard convolution operations apply a fixed geometric kernel (e.g., a 3×3 grid) uniformly across the input feature map, limiting its ability to model geometric transformations. To overcome this, we replace a key standard 3×3 convolution in the deeper layers of the backbone with Deformable Convolution v2 (DCNv2).

DCNv2 augments the standard convolution by adding learnable 2D offset fields and modulation scalars. For each location `p` on the output feature map `y`, the convolution samples input features not from a fixed grid but from adaptive, learned positions. This allows the network to dynamically adjust its receptive field based on the content, effectively “focusing” on the non-rigid structure of defects. The operation is defined as:

$$y(p) = \sum_{k=1}^{K} w_k \cdot x(p + p_k + \Delta p_k) \cdot \Delta m_k$$

where $ p_k $ enumerates the locations in the regular kernel (e.g., 9 points for a 3×3 kernel), $ \Delta p_k $ is the learned offset for the k-th location, and $ \Delta m_k $ is a learned modulation scalar in the range [0,1] that adjusts the importance of each sampled point. $ w_k $ represents the convolutional weight. By integrating DCNv2, the model gains a more flexible spatial modeling capacity, crucial for accurately capturing the nuanced shapes of defects in casting part welds.

2. ELAN-PCS: Lightweight and Attentive Feature Aggregation

The ELAN structure in YOLOv7 is designed to control gradient paths for stable training in deep networks. However, its use of multiple standard 3×3 convolutions contributes significantly to parameter count and computational load (FLOPs). To achieve a more lightweight model suitable for deployment in foundry environments, we redesign this module into the ELAN-PCS block.

The first component is **PConv (Partial Convolution)**. PConv is based on the observation that feature maps are often highly correlated across channels. It applies standard convolution only to a subset of input channels (e.g., the first or last 1/4) for spatial feature extraction, while leaving the remaining channels untouched. This simple strategy dramatically reduces redundant computation. The FLOPs for a PConv layer are approximately:

$$FLOPs_{PConv} \approx \frac{1}{4} \times h \times w \times k^2 \times c_{in} \times c_{out} + h \times w \times (c_{in} – c_{p})$$

where $h, w$ are spatial dimensions, $k$ is kernel size, $c_{in}, c_{out}$ are input/output channels, and $c_p$ is the number of channels processed by convolution. This is substantially lower than standard convolution’s $h \times w \times k^2 \times c_{in} \times c_{out}$.

The second component is the **SimAM (Simple, Parameter-free Attention Module)**. To compensate for any potential feature representation loss from light-weighting and to enhance the model’s focus on critical defect regions, we integrate SimAM after PConv. Unlike channel or spatial attention that generates 1D or 2D weights, SimAM infers 3D attention weights for the entire feature map without adding parameters. It defines an energy function for each neuron $t$:

$$e_t = \frac{4(\sigma^2 + \lambda)}{(t – \mu)^2 + 2\sigma^2 + 2\lambda}$$

Here, $\mu$ and $\sigma^2$ are the mean and variance of all neurons within a single channel of the input feature. $\lambda$ is a regularization constant. A lower energy $e_t$ indicates the neuron $t$ is more distinctive from its neighbors and thus more important. The final output feature $\widetilde{X}$ is obtained by amplifying these important neurons:

$$\widetilde{X} = sigmoid(\frac{1}{E}) \odot X$$

where $E$ is the set of all $e_t$, and $\odot$ denotes element-wise multiplication. The ELAN-PCS block thus combines efficient computation with powerful, parameter-free attention, making it ideal for extracting salient features from images of casting parts.

3. SE Attention for Enhanced Small Defect Sensitivity

Small and medium-sized defects like fine cracks or tiny pores are particularly challenging to detect in the vast surface area of a casting part. While the backbone extracts features, not all feature channels are equally important for identifying these subtle flaws. We incorporate **Squeeze-and-Excitation (SE)** attention blocks into specific feature fusion paths within the head network.

The SE block performs channel-wise recalibration. It first squeezes global spatial information into a channel descriptor $z \in \mathbb{R}^C$ via Global Average Pooling (GAP): $z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} x_c(i,j)$. Then, it excites this descriptor using a simple gating mechanism with a sigmoid activation to model channel-wise dependencies and produce adaptive weights $s$:

$$s = \sigma(W_2 \delta(W_1 z))$$

where $W_1$ and $W_2$ are weights of two fully-connected layers, $\delta$ is the ReLU activation, and $\sigma$ is the sigmoid function. The final output is the channel-wise multiplication of the original feature map $X$ and the weight vector $s$: $\widetilde{X}_c = s_c \cdot X_c$. This mechanism allows the model to emphasize informative feature channels related to critical defects while suppressing less useful ones, significantly improving detection recall for smaller anomalies in casting part welds.

Experimental Setup and Dataset Curation for Casting Part Evaluation

To validate the proposed improvements, a dedicated dataset of weld defects on casting parts was assembled and a rigorous experimental protocol was established.

Dataset Construction and Augmentation

The initial dataset comprised 2,038 images of casting part welds, annotated with six defect classes common in industrial settings: Weld Tumor, Slag Inclusion, Crack, Porosity, Underfill, and Arc Crater. To combat overfitting and improve model generalization—a critical need for the varied appearances of casting parts—an extensive data augmentation pipeline was applied, expanding the dataset to 12,228 images. The augmentations are listed below:

Augmentation Technique	Parameters / Range	Purpose
Random Rotation	Angle: [-7°, +7°]	Invariance to camera/view angle
Horizontal Translation	Shift: [0%, 10%] of width	Robustness to object placement
Additive Gaussian Noise	Mean=0, Variance=0.01	Simulates sensor noise
Random Brightness Adjustment	Delta: [-60, +80] intensity	Robustness to lighting variation
Random Cropping (Object-aware)	Based on target bounding box	Forces focus on defect context

The final dataset was split into training (70%), validation (10%), and testing (20%) sets to ensure unbiased evaluation.

Implementation Details and Evaluation Metrics

All models were trained and evaluated under a consistent environment to ensure fair comparison. The configuration is detailed in the following table:

Component	Configuration / Hyperparameter
Hardware (GPU)	NVIDIA Tesla P40 (24GB VRAM)
Framework	PyTorch with CUDA 10.1
Input Image Size	640 x 640 pixels
Training Epochs	230
Batch Size	32
Optimizer	SGD (Momentum=0.937, Weight Decay=0.0005)
Initial Learning Rate	0.01 (Cosine Annealing scheduler)

The performance was assessed using standard object detection metrics, with a focus on precision, recall, and different mAP thresholds suitable for casting part quality control:

Precision (P): $ P = \frac{TP}{TP + FP} $ – The accuracy of positive predictions.
Recall (R): $ R = \frac{TP}{TP + FN} $ – The ability to find all positive instances.
Average Precision (AP): The area under the Precision-Recall curve for a single class.
mean Average Precision (mAP): The mean of AP over all classes.
- mAP@0.5: AP calculated at a single Intersection-over-Union (IoU) threshold of 0.5.
- mAP@0.5:0.95: The average mAP computed over IoU thresholds from 0.5 to 0.95 in steps of 0.05. This is a stricter metric.
Model Parameters (MP): Total number of trainable parameters (in millions).
Frames Per Second (FPS): Inference speed on the test hardware, indicating real-time capability.

Results and Comprehensive Analysis

Comparative Analysis with State-of-the-Art Models

The proposed model was compared against several popular YOLO variants on the casting part weld defect test set. The results, presented in the table below, demonstrate the effectiveness of our approach.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	MP (M)	FPS
YOLOv3	88.2	55.6	59.4	25.5
YOLOv5s	93.3	64.9	7.0	65.8
YOLOv7	96.4	73.3	37.2	57.6
Original YOLOv7-tiny	95.0	70.6	6.0	144.9
Proposed Improved YOLOv7-tiny	96.8	77.4	4.1	178.6

Key Findings:

Superior Accuracy-Speed Trade-off: Our model achieves the highest mAP@0.5 (96.8%) among all compared models, surpassing even the larger YOLOv7. More importantly, it attains the highest mAP@0.5:0.95 (77.4%), indicating much better localization accuracy for defects in casting parts under strict IoU criteria.
Significant Lightweighting: With only 4.1 million parameters, our model has 1.9M fewer parameters than the original YOLOv7-tiny (a 31.7% reduction), making it extremely compact.
Unmatched Inference Speed: The inference speed of 178.6 FPS is the highest, showing a 23.3% speedup over the already-fast original YOLOv7-tiny. This makes it highly suitable for real-time inspection lines.

Ablation Study: Contribution of Each Component

An ablation study was conducted to isolate the contribution of each proposed modification. Starting from the baseline YOLOv7-tiny, components were added sequentially. The results are systematically shown below:

Experiment Configuration	Precision (P) (%)	Recall (R) (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	MP (M)
Baseline: YOLOv7-tiny	90.1	78.2	95.0	70.6	6.0
+ DCNv2 only	91.9	79.6	95.5	75.1	6.1
+ PConv only	91.4	78.3	95.6	72.8	3.9
+ PConv + SimAM (ELAN-PCS)	91.2	80.3	95.6	73.4	3.9
+ ELAN-PCS + DCNv2	91.9	79.6	95.7	73.7	4.1
Full Model: + ELAN-PCS + DCNv2 + SE	92.1	83.8	96.8	77.4	4.1

Ablation Analysis:

DCNv2: Improves both precision and recall, and gives a substantial 4.5-point boost in strict mAP@0.5:0.95, validating its role in modeling irregular geometries in casting part defects.
PConv/ELAN-PCS: Drastically reduces parameters to 3.9M with minimal accuracy drop. Adding SimAM helps recover and slightly improve recall.
SE Attention (Final Step): The integration of SE blocks provides the most significant jump in Recall (from 79.6% to 83.8%) and the final mAP gains. This confirms that channel-wise feature recalibration is critical for detecting elusive small and medium defects in casting part welds, reducing missed detections (false negatives).

Qualitative Detection Results on Casting Parts

Visual inspection of detection results on challenging test images further validates the model’s robustness. The improved model consistently outperforms the baseline YOLOv7-tiny in several key scenarios common when inspecting large casting parts:

Complex Backgrounds: On images with strong textual markings or intricate casting textures, the baseline model frequently produced false positives or misclassified defects (e.g., labeling an arc crater as porosity). Our model, with its enhanced feature discrimination from DCNv2 and SE attention, correctly identifies and classifies the defects while ignoring background clutter.
Low-Light/Shadow Conditions: Welds on large casting parts often have uneven lighting. The improved model shows higher confidence scores and better bounding box localization for defects like weld tumors and underfill in shadowed regions, thanks to the more robust feature representation.
Multi-Defect & Small Object Scenes: In images containing multiple defects of varying sizes, the proposed model demonstrates superior detection capability, particularly for smaller slag inclusions and fine cracks, which the baseline model occasionally missed. The significant boost in recall is visually evident here.

Conclusion

This work presents a significantly enhanced and lightweight YOLOv7-tiny model specifically tailored for the automated visual inspection of weld defects in large casting parts. The core improvements address the unique challenges of this industrial task: irregular defect shapes, the need for low-latency deployment, and the difficulty of detecting small anomalies. By integrating Deformable Convolution (DCNv2), we endowed the model with adaptive spatial modeling capacity to capture the geometry of flaws like cracks and porosity. The novel ELAN-PCS block, combining efficient Partial Convolution (PConv) with a parameter-free SimAM attention mechanism, successfully reduced the model’s parameter count by 1.9M (31.7%) while preserving and even enhancing its feature representation power. Finally, the strategic inclusion of SE Channel Attention modules dramatically improved the model’s sensitivity, leading to a substantial increase in recall, which is critical for ensuring no critical defect on a casting part goes undetected.

Comprehensive experiments on a curated dataset of casting part welds demonstrate that the proposed model achieves state-of-the-art performance. It surpasses the original YOLOv7-tiny by 1.8% in mAP@0.5 and by a remarkable 6.8% in the stricter mAP@0.5:0.95, all while being faster (178.6 FPS) and having fewer parameters (4.1M). This optimal balance of high precision, strong generalization under strict metrics, real-time speed, and a compact size makes the proposed improved YOLOv7-tiny an excellent candidate for deployment in real-world industrial settings for the quality assurance of large and critical casting parts.