Fusing Local and Global Features for Defect Detection in Casting Parts

The pervasive integration of lightweight, high-strength aluminum alloy casting parts into critical industries such as automotive and aerospace has elevated the importance of their structural integrity. The safe operation of mechanical components is directly contingent upon the quality of these castings. However, the manufacturing process is susceptible to a variety of surface and internal defects, which can be minuscule, diverse, and often indiscernible to the human eye. Traditional manual inspection methods are not only labor-intensive and inefficient but also prone to fatigue-related inaccuracies. Consequently, there is a compelling industrial demand for automated, highly accurate, and reliable non-destructive testing (NDT) systems. X-ray imaging stands as a prominent NDT technique, capable of revealing internal flaws. The core challenge lies in developing automated systems that can robustly interpret these complex X-ray images, distinguishing defective casting parts from sound ones amidst significant variations in part geometry, orientation, and imaging background.

The advent of deep learning, particularly Convolutional Neural Networks (CNNs), has revolutionized the field of automated visual inspection. Early approaches demonstrated the feasibility of using CNNs for defect detection in X-ray images. For instance, simplified network architectures like Xnet were developed, incorporating techniques like dropout to mitigate overfitting. These models showed promise but often struggled in complex industrial scenarios where defects are local, subtle, and obscured by intricate backgrounds. Subsequent research introduced more sophisticated architectures, such as Mask R-CNN-based systems, which performed simultaneous defect detection and segmentation. While powerful, these methods rely on costly pixel-level or bounding-box annotations. More recently, the focus has shifted towards weakly supervised learning, which only requires image-level labels (e.g., “defective” or “non-defective”), drastically reducing annotation costs. Models employing attention mechanisms within a weakly supervised framework have shown remarkable success in guiding the network to focus on discriminative local regions without explicit part annotations.

Despite these advances, the specific task of detecting defects in aluminum alloy casting parts from X-ray images presents unique and formidable challenges. The problem is essentially a fine-grained visual classification task with extreme intra-class variation and subtle inter-class differences. The “intra-class variation” refers to the vast differences in appearance among different types of casting parts (different shapes, sizes, backgrounds), all belonging to the general class of “casting parts.” The “inter-class difference” refers to the often minimal visual distinction between a defective and a non-defective sample of the same casting part. A model must ignore the large variations caused by part type and pose while being exquisitely sensitive to tiny, localized flaws. Relying solely on global feature learning tends to be misled by the dominant background and part structure. Conversely, focusing only on a local region risks missing the contextual information needed to accurately locate the relevant defect area in the first place, especially when its position varies with the part’s orientation.

To address these intertwined challenges, we propose a novel weakly supervised deep learning model that synergistically fuses local and global features for defect detection in X-ray images of casting parts. Our core insight is that global contextual information is crucial for reliably locating potential defect regions, while localized, high-resolution examination of those regions is essential for making the final accurate classification. The two processes are interdependent and complementary. Our method, therefore, constructs a dual-branch network architecture. The first branch processes the original, global image to learn coarse structural and contextual features. Guided by these features, we introduce a novel Detail Information Location and Extraction (DILE) module to automatically identify and crop the most discriminative local region from the image. This cropped region, rich in potential defect details, is then fed into the second branch. Both branches share the same feature extraction backbone, augmented with an efficient channel attention (ECA) mechanism to enhance feature representation. The final classification is derived from the local branch, having been primed by the global context. The entire system is trained end-to-end using only image-level labels. We validate our approach on a real-world industrial dataset of automotive casting part X-ray images, achieving state-of-the-art performance.

The overarching architecture of our proposed method is illustrated in the conceptual diagram. It consists of two primary pathways: the Global Feature Branch and the Local Feature Branch, which are integrated through a shared-weight backbone and a joint loss function. The input to the network is a single X-ray image of a casting part. Let this input image be denoted as $I_{global} \in \mathbb{R}^{H \times W \times 3}$ (resized and replicated to three channels).

The Global Feature Branch takes $I_{global}$ as input. It passes through a foundational CNN backbone (e.g., ResNet-50) to extract a hierarchy of feature maps. To enhance the network’s ability to model channel-wise dependencies efficiently, we integrate an Efficient Channel Attention (ECA) module after the backbone’s final convolutional blocks. The ECA module generates channel-wise attention weights without dimensionality reduction, preserving the direct correspondence between channels and their weights. Given an input feature map $X \in \mathbb{R}^{C \times H’ \times W’}$, it first applies global average pooling (GAP) to produce a channel descriptor $z \in \mathbb{R}^{C}$:
$$z_c = \frac{1}{H’ \times W’} \sum_{i=1}^{H’} \sum_{j=1}^{W’} X_c(i, j)$$
Then, it performs a fast 1D convolution with kernel size $k$ to capture local cross-channel interaction:
$$\omega = \sigma(\text{C1D}_k(z))$$
where $\sigma$ is the Sigmoid activation function, and $\text{C1D}_k$ denotes the 1D convolution. The attention weight vector $\omega$ is then used to rescale the original feature map $X$. The processed features are then passed through a fully connected (FC) layer to produce a preliminary class prediction. The loss for this branch, termed the Rough Loss, is the standard cross-entropy loss:
$$\mathcal{L}_{rough} = -\log(P_{global}(c))$$
where $P_{global}(c)$ is the predicted probability for the ground-truth class $c$ from the global branch.

The key innovation that bridges the global and local processing is the Detail Information Location and Extraction (DILE) module. The DILE module’s objective is to identify the image region containing the most discriminative fine-grained details, which likely correspond to defect areas in a defective casting part. It operates on the feature maps from the global branch’s backbone. We utilize activation maps from two different convolutional layers (e.g., `Conv_5b` and `Conv_5c` of ResNet-50) to generate localization cues. For a feature map $F \in \mathbb{R}^{C \times H” \times W”}$, an activation map $A$ is computed by summing across the channel dimension:
$$A = \sum_{k=1}^{C} F_k$$
A threshold $a$ is calculated as the mean value of $A$:
$$a = \frac{1}{H” \times W”} \sum_{i=1}^{H”} \sum_{j=1}^{W”} A(i, j)$$
A binary mask $M$ is generated by thresholding:
$$M(i,j) =
\begin{cases}
1, & \text{if } A(i,j) > a \\
0, & \text{otherwise}
\end{cases}$$
The masks from the two different layers are intersected to obtain a more precise and stable mask. This final mask is resized to the original input dimensions and used to extract the bounding box of the largest connected component. This bounding box is then applied to crop the original input image $I_{global}$, producing the local detail image $I_{local}$.

The Local Feature Branch takes $I_{local}$ (resized to a fixed dimension) as its input. It is processed by the same shared-weight CNN backbone and ECA module as the global branch. This design ensures feature consistency and allows the local branch to focus intensely on the features within the proposed region, which has been contextually selected by the global branch. The features from this branch are fed into another shared FC layer to produce the final detailed class prediction. The loss for this branch is the Detail Loss:
$$\mathcal{L}_{detail} = -\log(P_{local}(c))$$
where $P_{local}(c)$ is the predicted probability from the local branch.

The total loss function for training the entire dual-branch network is a weighted combination of the two branch losses:
$$\mathcal{L}_{total} = \alpha \cdot \mathcal{L}_{rough} + \beta \cdot \mathcal{L}_{detail}$$
where $\alpha$ and $\beta$ are hyperparameters that balance the contribution of each branch’s learning signal. During inference, only the Local Feature Branch’s output is used for the final defect classification of the casting part. The global branch’s role is to provide the contextual guidance necessary for the DILE module to function effectively during training and forward pass.

We evaluate our proposed method on a real-world industrial dataset comprising X-ray images of automotive aluminum alloy casting parts. The dataset was constructed to ensure class balance, containing 4,496 non-defective and 4,496 defective images, totaling 8,992 images. The dataset was split into 6,992 for training, 1,000 for validation, and 1,000 for testing. All images have a resolution of 1000×1000 pixels and were resized to 448×448 for network input. Extensive data augmentation was applied, including random flipping, contrast/saturation/brightness adjustment, noise injection, and cropping to improve model robustness.

Our model was implemented using PyTorch. We used a ResNet-50 model, pre-trained on ImageNet, as our shared backbone. The ECA module was integrated after the last convolutional block of both branches. The model was trained using Stochastic Gradient Descent (SGD) with a momentum of 0.9, weight decay of 0.0001, and a batch size of 8. The initial learning rate was set to 0.001 and was reduced by a factor of 10 at epochs 40 and 60. The hyperparameters $\alpha$ and $\beta$ in the loss function were both set to 1. Training proceeded for 100 epochs.

We compared our model against several mainstream CNN architectures and recent state-of-the-art methods for defect detection and fine-grained classification. All models were trained and evaluated under the same conditions on our dataset. Performance was measured using standard metrics: Accuracy, Precision, Recall, and F1-Score. The results are summarized in the table below.

Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VGG-16	90.0	88.5	92.0	90.2
VGG-19	85.0	85.7	84.0	84.8
ResNet-18	88.5	87.4	90.0	88.7
ResNet-34	83.2	82.7	84.0	83.3
Xception	90.3	88.5	92.6	90.5
WS-DAN	92.0	93.8	90.0	91.9
MC-DAN	95.5	95.1	96.0	95.5
Our Approach	98.3	99.2	96.7	97.9

The results clearly demonstrate the superiority of our proposed method. It outperforms all conventional CNN models by a significant margin, achieving a test accuracy of 98.3%. It also surpasses recent advanced methods like WS-DAN and MC-DAN, which also employ attention and data augmentation strategies. This indicates that our strategy of explicitly fusing globally-guided local features is particularly effective for the challenging task of defect detection in casting parts.

To substantiate the contribution of each component in our architecture, we conducted ablation studies. First, we investigated the impact of the ECA attention module and the design of the loss function.

Ablation on the Attention Mechanism: We trained three variants of our model: without any channel attention, with the Squeeze-and-Excitation (SE) attention module, and with our chosen ECA module. The results are shown below.

Model Variant	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Without Attention	97.6	98.3	95.8	97.0
With SE Attention	97.8	98.7	95.8	97.2
With ECA Attention	98.3	99.2	96.7	97.9

The ECA module yields the best performance. While SE provides a minor improvement, ECA’s more efficient and direct modeling of channel interactions leads to clearer gains across all metrics, confirming its suitability for enhancing feature representation in our task.

Ablation on the Loss Function and Branch Contribution: The hyperparameters $\alpha$ and $\beta$ control the influence of the global and local branches during training. We analyzed three critical configurations.

Parameters ($\alpha$, $\beta$)	Description	Accuracy (%)
(1, 0)	Only global branch is active (trained with $\mathcal{L}_{rough}$ only).	96.9
(0, 1)	Only local branch is active (trained with $\mathcal{L}_{detail}$ only, using random crops).	96.9
(1, 1)	Both branches active, our full model.	98.3

The results are revealing. Using only the global branch ($\alpha=1, \beta=0$) leads to lower accuracy (96.9%). The model learns general features but is likely distracted by the dominant background and part structure, failing to consistently focus on subtle defect cues. Using only the local branch ($\alpha=0, \beta=1$) with randomly cropped regions performs equally poorly, as the network frequently receives irrelevant local patches that do not contain discriminative information. In stark contrast, our full model ($\alpha=1, \beta=1$) achieves the highest accuracy (98.3%). This empirically validates our core hypothesis: the global branch provides essential contextual guidance to the DILE module, enabling it to consistently propose relevant local regions. The local branch can then specialize in analyzing these high-value regions. The two branches are mutually reinforcing, and their joint optimization is crucial for superior performance in detecting defects in casting parts.

To gain intuitive insight into our model’s operation, we visualize the regions located by the DILE module and the final attention maps of the local branch. The DILE module successfully identifies and crops regions containing rich, discriminative details, such as specific junctions, edges, or textured areas that are critical for distinguishing defect states. These regions vary significantly across different casting part types and orientations, demonstrating the module’s adaptability. Further visualization of the local branch’s attention (e.g., using Grad-CAM) shows that the network focuses intensely on even finer structures within the proposed region, often pinpointing the exact flaw, such as a tiny crack or porosity. This two-stage focus—from global context to local region proposal, and then to fine-grained analysis within that region—is the visual manifestation of our method’s effectiveness.

In this work, we have presented a novel dual-branch deep learning framework for the challenging task of defect detection in X-ray images of aluminum alloy casting parts. The method operates under a weakly supervised paradigm, requiring only image-level labels. Its key innovation is the synergistic fusion of global and local feature learning. The global branch comprehends the overall structure and context of the casting part, which is leveraged by our proposed DILE module to intelligently locate the most discriminative local region. This region is then meticulously analyzed by the local branch, whose feature representation is enhanced by an efficient channel attention mechanism. The two branches are co-trained with a composite loss function, enabling them to complement each other.

Extensive experiments on a real-world industrial dataset demonstrate that our approach significantly outperforms standard CNN architectures and recent advanced methods, achieving an accuracy of 98.3%. Ablation studies confirm the necessity of both global and local pathways and the efficacy of the ECA module. The proposed method offers a robust, accurate, and practical solution for automated quality inspection of casting parts, with the potential to enhance safety and efficiency in manufacturing industries. A current limitation is the assumption of a single primary defect region; future work will explore extensions to handle multiple defect locations and the identification of specific defect types within the same framework.