Enhancing Defect Prediction in Complex Sand Castings through SMOTE-Based Data Preprocessing

In the manufacturing sector, sand casting remains a cornerstone process for producing metal components, especially those with intricate geometries. However, the production of complex castings is often plagued by various sand casting defects, such as cold shuts, blowholes, sand inclusions, and shrinkage porosity. These defects compromise the structural integrity and performance of critical parts, leading to significant economic losses. The advent of data-driven artificial intelligence methods offers a promising avenue for predicting these sand casting defects by analyzing historical production data, thereby enabling proactive process adjustments. While foundries generate vast amounts of data, a critical challenge emerges: the dataset is inherently imbalanced. Records for defect-free castings vastly outnumber those for defective ones, and within the defective category, data for specific flaw types like shrinkage are exceedingly rare. Training predictive models on such imbalanced data yields biased and inaccurate classifiers, as models tend to simply predict the majority class (defect-free) to achieve deceptively high accuracy. This work addresses this pivotal issue by investigating the application of the Synthetic Minority Oversampling Technique (SMOTE) algorithm to preprocess imbalanced foundry data, thereby constructing a robust framework for the accurate prediction of sand casting defects.

The foundation of any data-driven model is the quality and structure of its training data. In this study, the data pertains to the production of complex steering bridge castings for forklift trucks, a critical safety component made from ductile iron (Grade QT450-10). The initial dataset, collected directly from the manufacturing floor, contained records of key process parameters believed to influence casting quality. These parameters were categorized into several groups, as summarized below:

Parameter Category	Specific Parameters (Examples)
Chemical Composition	C, Si, Mn, Mg, S, P, Al content
Molding Sand Properties	Green compression strength, compactability, bentonite content, new sand addition, return sand temperature & moisture
Pouring Parameters	Pouring temperature, pouring time, inoculant rate, pouring weight
Target / Defect Label	Defect type: None, Cold Shut, Blowhole, Sand Inclusion, Shrinkage

An initial cleansing step removed records irrelevant to quality prediction. The resulting dataset starkly revealed the imbalance problem. The distribution was as follows: 5,848 defect-free records, 274 for cold shuts, 399 for blowholes, 359 for sand inclusions, and a mere 148 for shrinkage porosity. This extreme disproportion makes it statistically difficult for a learning algorithm to discern the patterns leading to rare sand casting defects.

Formally, let $D = {(x_i, y_i)}_{i=1}^{N}$ represent the training dataset, where $x_i$ is the feature vector (process parameters) and $y_i \in {C_1, C_2, …, C_M}$ is the class label. The dataset is imbalanced if the number of samples for one or more minority classes $C_{min}$ is significantly less than that of the majority class $C_{maj}$: $|C_{min}| << |C_{maj}|$. In our case, $C_{maj}$ is “defect-free,” and $C_{min}$ represents the four defect types. A classifier $f(x)$ trained on $D$ will typically maximize overall accuracy by favoring $C_{maj}$, failing to generalize for predicting $C_{min}$. To illustrate, a naive model that always predicts “defect-free” would achieve an accuracy of:
$$ \text{Naive Accuracy} = \frac{|C_{maj}|}{N} \times 100\% \approx \frac{5848}{7028} \times 100\% \approx 83.2\% $$
This is misleading and useless for practical defect detection.

Two primary approaches exist to handle class imbalance: undersampling the majority class and oversampling the minority class. Undersampling, such as the Edited Nearest Neighbor (ENN) algorithm, removes samples from $C_{maj}$. While it balances classes, it risks discarding potentially valuable information, which is untenable when production data is already scarce. Oversampling, conversely, increases the number of samples in $C_{min}$. Simple duplication leads to overfitting. Therefore, the Synthetic Minority Oversampling Technique (SMOTE) was selected for its ability to generate *new*, plausible minority class samples. The core idea of SMOTE is to interpolate between existing minority samples. For a given minority sample $x_i$, its $k$ nearest neighbors within the minority class are identified. New synthetic samples $x_{new}$ are then created along the line segments joining $x_i$ to its neighbors. The generation formula for a synthetic sample between $x_i$ and a randomly chosen neighbor $x_{zi}$ is:
$$ x_{new} = x_i + \delta \cdot (x_{zi} – x_i) $$
where $\delta$ is a random number between 0 and 1. This effectively “populates” the feature space region between known defect cases, allowing the decision boundary to be more precisely learned.

The SMOTE algorithm was implemented in Python to preprocess the steering bridge casting data. The process involved loading the normalized data, calculating k-nearest neighbors for each defect sample, and generating synthetic samples for each minority class until their numbers were commensurate with the majority class. A sampling strategy was carefully chosen to achieve a balanced dataset. The result was a significant, scientifically-guided expansion of the dataset, as shown in the comparison table below.

Class Label	Original Sample Count	Sample Count After SMOTE
Defect-Free (None)	5,848	5,848 (unchanged)
Cold Shut	274	~6,006
Blowhole	399	~6,384
Sand Inclusion	359	~6,408
Shrinkage Porosity	148	~6,215
Total	7,028	~30,861

To rigorously evaluate the impact of this preprocessing step, a neural network-based classifier was developed to predict the occurrence of sand casting defects. The model architecture included hidden layers with Rectified Linear Unit (ReLU) activation functions and an output layer with a Softmax activation function for multi-class classification. The ReLU function is defined as:
$$ f(x) = \max(0, x) $$
The Softmax function for class $i$ is given by:
$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $$
where $z_i$ is the input to the output neuron for class $i$, and $C$ is the total number of classes (5 in our case). The model was trained to minimize the categorical cross-entropy loss $L$, which for a single sample is:
$$ L_i = -\sum_{c=1}^{C} y_{i,c} \cdot \log(\hat{y}_{i,c}) $$
where $y_{i,c}$ is the true label (1 if sample $i$ belongs to class $c$, else 0) and $\hat{y}_{i,c}$ is the predicted probability. The primary evaluation metric was accuracy, but the behavior of the loss function during training provided critical insights into model stability and learning efficacy.

The performance contrast between models trained on the raw imbalanced data and the SMOTE-balanced data was profound. When trained on the imbalanced dataset, the model’s learning process was erratic. The cross-entropy loss did not decrease monotonically; it exhibited significant fluctuations and occasional increases during training iterations. Concurrently, the accuracy metric oscillated violently, particularly between certain epochs. This instability is a classic symptom of learning from imbalanced data, where the gradient updates are dominated by the majority class, preventing effective learning of the minority class discriminants. The final test accuracy plateaued at approximately 86.5%, which, as argued, is not a reliable indicator of true predictive power for sand casting defects.

In stark contrast, the model trained on the SMOTE-processed balanced dataset demonstrated stable and robust learning behavior. The cross-entropy loss decreased smoothly and monotonically towards zero across thousands of iterations. The accuracy increased consistently without any destabilizing oscillations. This indicates that the model was effectively learning discriminative features for all classes, including the previously rare sand casting defects. The final performance metrics showed a substantial improvement, with training accuracy reaching ~97.99% and, more importantly, test accuracy achieving ~97.91%. The detailed progression of key metrics during training on the balanced data is illustrated in the following table, highlighting the stable convergence.

Iteration	Training Accuracy	Training Loss (Cross-Entropy)	Test Accuracy	Test Loss (Cross-Entropy)
0	0.208	10.913	0.205	10.967
500	0.947	0.222	0.944	0.225
1000	0.964	0.154	0.962	0.159
2000	0.973	0.111	0.971	0.117
3000	0.976	0.094	0.974	0.099
5000	0.980	0.075	0.979	0.080

This investigation conclusively demonstrates the critical importance of data preprocessing, specifically addressing class imbalance, in developing reliable AI models for industrial quality prediction. The severe imbalance inherent in real-world production data for complex castings creates a fundamental obstacle to accurate defect prediction. By introducing and implementing the SMOTE algorithm, we scientifically augmented the minority classes representing various sand casting defects. This transformation of the dataset from an imbalanced to a balanced state was the key enabling factor. The subsequent neural network model trained on this prepared data learned stable and generalizable patterns, leading to a predictive accuracy increase from 86.50% to 97.91%. This represents a transformative improvement in model reliability. Therefore, the integration of advanced data preprocessing techniques like SMOTE is not merely an optional step but a necessary prerequisite for deploying effective, data-driven predictive maintenance and quality control systems in sand foundries. It empowers the accurate forecasting of sand casting defects, which directly enables proactive corrections in the manufacturing process, reduces scrap rates, enhances product quality, and delivers substantial economic benefits for precision casting operations.