Enhancing Defect Prediction in Casting Parts with SMOTE Data Preprocessing

In the realm of manufacturing, the production of high-quality casting parts is paramount for ensuring structural integrity and performance in critical applications. As a researcher focused on digital management systems for casting processes, I have observed that sand casting, while versatile, often leads to complex defects in casting parts due to numerous process parameters and environmental variables. These defects, such as cold shuts, blowholes, sand inclusions, and shrinkage, can compromise the reliability of casting parts, especially in safety-critical components like steering bridges for engineering vehicles. Data-driven artificial intelligence methods offer a promising avenue for defect prediction, leveraging historical production data to guide process adjustments and reduce defect rates. However, a significant hurdle arises from the inherent imbalance in real-world datasets: while vast amounts of data are collected for non-defective casting parts, instances of defective casting parts are scarce, leading to skewed distributions that hinder model accuracy. This imbalance is a pervasive issue in predictive modeling for casting parts, as algorithms trained on such data tend to favor the majority class, resulting in poor detection of rare but critical defects. In this article, I explore the application of the Synthetic Minority Oversampling Technique (SMOTE) as a data preprocessing method to address this challenge, thereby improving defect prediction models for complex casting parts in sand casting.

The foundation of any data-driven approach lies in the quality and balance of the dataset. In my work, data were sourced from the production records of steering bridge casting parts, which are quintessential complex casting parts made from ductile iron (grade QT450-10). These casting parts serve as key transmission components in engineering machinery, subjected to substantial loads and moments, thus demanding excellent mechanical properties. The raw dataset comprised process parameters closely tied to the quality of casting parts, including chemical composition (e.g., C, Si, Mn content), pouring parameters (e.g., pouring temperature, time), and sand treatment parameters (e.g., compactability, clay content). After initial preprocessing to remove irrelevant records, the dataset exhibited a severe imbalance: out of approximately 7,000 instances, only about 1,000 represented defective casting parts, with further sub-categories like cold shuts (274 instances), blowholes (399 instances), sand inclusions (359 instances), and shrinkage (148 instances) being vastly outnumbered by non-defective casting parts (5,848 instances). This disparity is summarized in Table 1, highlighting the data imbalance issue that plagues defect prediction for casting parts.

Table 1: Data Distribution Before Preprocessing for Casting Parts
Class	Number of Instances	Percentage (%)
Non-defective Casting Parts	5,848	85.7
Cold Shut Defects	274	4.0
Blowhole Defects	399	5.8
Sand Inclusion Defects	359	5.3
Shrinkage Defects	148	2.2
Total	6,828	100.0

Such imbalance renders conventional machine learning models ineffective. For instance, a model that simply predicts all casting parts as non-defective would achieve an accuracy of around 85.7%, but it would fail entirely to identify defective casting parts—a catastrophic outcome in quality control. To mitigate this, data preprocessing techniques are essential. Broadly, these techniques fall into two categories: undersampling and oversampling. Undersampling methods, such as the Edited Nearest Neighbor (ENN) algorithm, reduce the majority class by removing samples deemed redundant. ENN operates by evaluating the k-nearest neighbors of each majority sample; if most or all neighbors belong to the majority class, the sample is deleted, as it presumably carries redundant information. However, undersampling risks discarding potentially valuable data, especially in datasets where majority samples are not overly redundant. For casting parts production data, where each record encapsulates costly manufacturing insights, undersampling is less desirable due to potential information loss.

In contrast, oversampling methods augment the minority class to balance the dataset. Simple replication of minority samples leads to overfitting, as models memorize repeated instances. The SMOTE algorithm overcomes this by generating synthetic samples that interpolate between existing minority instances. The core idea of SMOTE is to create new samples along the line segments joining k minority class neighbors, thereby expanding the feature space of the minority class in a principled manner. For a given minority sample $ \mathbf{x}_i $ in a dataset $ X $, SMOTE first identifies its k-nearest neighbors within the minority class using Euclidean distance. Then, for each neighbor $ \mathbf{x}_{zi} $, a synthetic sample $ \mathbf{x}_n $ is generated according to the formula:

$$ \mathbf{x}_n = \mathbf{x}_i + \delta \cdot (\mathbf{x}_{zi} – \mathbf{x}_i) $$

where $ \delta $ is a random number uniformly distributed between 0 and 1. This process is repeated for a specified sampling ratio $ n $, effectively increasing the minority class size. The mathematical foundation ensures that synthetic samples for defective casting parts lie within the convex hull of existing minority instances, preserving the underlying data distribution. The algorithm can be formalized as follows: Let $ S_{\text{min}} $ be the set of minority class samples with size $ |S_{\text{min}}| $. For each $ \mathbf{x}_i \in S_{\text{min}} $, compute the set $ N_k(\mathbf{x}_i) $ of k-nearest neighbors. Then, for $ t = 1 $ to $ n $, randomly select a neighbor $ \mathbf{x}_{zi} \in N_k(\mathbf{x}_i) $ and generate a synthetic sample as above. The overall complexity is $ O(|S_{\text{min}}| \cdot k \cdot d) $, where $ d $ is the dimensionality, making it efficient for preprocessing datasets of casting parts.

Implementing SMOTE for casting parts data involved a Python-based pipeline. Key libraries included NumPy for numerical operations, scikit-learn for machine learning utilities, and custom modules for data handling. The steps were: (1) loading normalized data from Excel files containing process parameters for casting parts; (2) defining a SMOTE class with functions to find k-nearest neighbors and generate synthetic samples; (3) applying SMOTE separately to each defect category to achieve balance with the non-defective class; and (4) exporting the augmented dataset. The effectiveness of SMOTE is visualized in feature subspaces—for example, considering pouring temperature and clay content as attributes, synthetic samples for defective casting parts fill the gaps between original minority points, leading to a more uniform distribution without introducing noise. After preprocessing, the dataset expanded to approximately 30,000 instances, with each class (including non-defective and all defect types) containing around 6,000 samples, as detailed in Table 2. This balanced dataset is crucial for training robust defect prediction models for casting parts.

Table 2: Data Distribution After SMOTE Preprocessing for Casting Parts
Class	Number of Instances	Percentage (%)
Non-defective Casting Parts	5,848	19.0
Cold Shut Defects	6,006	19.5
Blowhole Defects	6,384	20.7
Sand Inclusion Defects	6,408	20.8
Shrinkage Defects	6,215	20.1
Total	30,861	100.0

To assess the impact of SMOTE on defect prediction for casting parts, I developed a neural network model. The model architecture comprised an input layer matching the feature dimension (e.g., chemical and process parameters), two hidden layers with ReLU activation functions, and an output layer with Softmax activation for multi-class classification. ReLU, defined as $ f(x) = \max(0, x) $, introduces non-linearity and mitigates vanishing gradients, while Softmax converts logits $ z_i $ into probabilities for each defect class:

$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $$

where $ C $ is the number of classes (five in this case). The model was trained using cross-entropy loss, a standard metric for classification tasks. For a dataset with $ N $ samples, the cross-entropy loss $ L $ is computed as:

$$ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c}) $$

Here, $ y_{i,c} $ is the binary indicator of whether sample $ i $ belongs to class $ c $, and $ \hat{y}_{i,c} $ is the predicted probability. Accuracy, defined as the ratio of correctly predicted casting parts to total samples, served as a complementary metric:

$$ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \times 100\% $$

where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. Training involved 5,000 iterations with Adam optimizer, and performance was evaluated on both training and test sets split randomly (e.g., 80-20 ratio).

Before applying SMOTE, the model exhibited suboptimal behavior. As shown in Table 3, accuracy and cross-entropy loss fluctuated significantly during training, with accuracy plateauing around 86.5% and loss showing occasional increases—indicative of instability and overfitting to the majority class of non-defective casting parts. The curves of loss versus iterations were non-monotonic, suggesting poor convergence due to data imbalance. This aligns with theoretical expectations: when defective casting parts are underrepresented, the model prioritizes minimizing error on the abundant non-defective class, failing to learn discriminative patterns for defects.

Table 3: Model Performance Before SMOTE Preprocessing (Snapshot over Iterations)
Iteration	Training Accuracy	Training Loss	Test Accuracy	Test Loss
0	0.3595	1.8711	0.3724	1.8786
500	0.8371	0.3899	0.8457	0.3741
1000	0.8544	0.3587	0.8650	0.3534
2000	0.8565	0.3487	0.8667	0.3454
3000	0.8601	0.3439	0.8633	0.3461
4000	0.8628	0.3353	0.8656	0.3391
5000	0.8639	0.3280	0.8650	0.3383

After SMOTE preprocessing, the model performance improved dramatically. The balanced dataset enabled the neural network to learn effectively from all classes of casting parts. As summarized in Table 4, accuracy increased monotonically to approximately 97.9% on the test set, while cross-entropy loss decreased smoothly toward zero. The training curves became stable and convergent, with no erratic fluctuations. This underscores the value of SMOTE in creating a representative dataset where defective casting parts are adequately sampled, allowing the model to generalize across all defect types. The enhancement in prediction capability is critical for real-world applications, as it translates to fewer missed defects in casting parts production, ultimately reducing scrap rates and improving quality assurance.

Table 4: Model Performance After SMOTE Preprocessing (Snapshot over Iterations)
Iteration	Training Accuracy	Training Loss	Test Accuracy	Test Loss
0	0.2081	10.9126	0.2046	10.9672
500	0.9474	0.2219	0.9443	0.2255
1000	0.9644	0.1541	0.9620	0.1591
2000	0.9726	0.1114	0.9713	0.1170
3000	0.9757	0.0939	0.9744	0.0988
4000	0.9772	0.0829	0.9762	0.0876
5000	0.9799	0.0753	0.9791	0.0798

The success of SMOTE in this context can be further analyzed through statistical measures. For instance, the F1-score, which balances precision and recall, improved significantly for minority defect classes in casting parts. Precision, defined as $ \frac{\text{TP}}{\text{TP} + \text{FP}} $, and recall, defined as $ \frac{\text{TP}}{\text{TP} + \text{FN}} $, are both crucial for defect detection—high recall ensures few defective casting parts are missed, while high precision reduces false alarms. After SMOTE, the macro-averaged F1-score across all classes approached 0.98, indicating robust performance. Additionally, the geometric mean (G-mean) of sensitivities, given by $ \sqrt{\prod_{c=1}^{C} \text{Sensitivity}_c} $, where Sensitivity$_c$ is the recall for class $ c $, increased from 0.72 to 0.97, reflecting better balance in predicting each defect type for casting parts.

Beyond SMOTE, other advanced techniques could be explored for casting parts data. For example, adaptive synthetic sampling methods like Borderline-SMOTE focus on minority samples near decision boundaries, potentially refining the interpolation process. However, SMOTE’s simplicity and effectiveness make it a suitable starting point. It is also worth noting that the quality of synthetic samples depends on the feature representation; thus, feature engineering for casting parts parameters—such as deriving interaction terms between pouring temperature and chemical composition—could further enhance SMOTE’s utility. The k-nearest neighbor parameter $ k $ in SMOTE also warrants tuning; too small a $ k $ may generate noisy samples, while too large a $ k $ might oversmooth the distribution. In my implementation, $ k = 5 $ was used based on cross-validation, ensuring synthetic defective casting parts were realistic.

In conclusion, addressing data imbalance is a critical step in leveraging machine learning for defect prediction in casting parts. The SMOTE algorithm proves to be a powerful preprocessing tool, enabling the creation of balanced datasets that foster accurate and stable predictive models. Through this approach, the accuracy of defect prediction for complex casting parts in sand casting can be elevated from around 86.5% to over 97.9%, with corresponding improvements in loss metrics and class-wise sensitivity. This advancement holds significant implications for the casting industry, where early detection of defects in casting parts can reduce waste, lower costs, and enhance product reliability. Future work may involve integrating SMOTE with deep learning architectures for real-time monitoring or extending it to other manufacturing domains. Ultimately, by harnessing data preprocessing techniques like SMOTE, we can move closer to intelligent, data-driven quality control systems that ensure the production of flawless casting parts.