Enhancing Sand Casting Defect Prediction Through Intelligent Data Preprocessing

The pursuit of zero-defect manufacturing remains a paramount goal within the foundry industry, particularly for complex components produced via sand casting. These sand casting defects, such as cold shuts, blowholes, sand inclusions, and shrinkage porosity, are not merely cosmetic flaws but critical failures that can compromise the structural integrity and functional performance of safety-critical parts. The emergence of data-driven artificial intelligence methods offers a transformative pathway for proactive quality control. By analyzing historical production data encompassing alloy chemistry, pouring parameters, and sand system properties, predictive models can potentially forecast the occurrence of specific defects, allowing for preemptive adjustments to the process. However, the practical application of these sophisticated algorithms in real-world sand casting defect prediction scenarios is severely hampered by a fundamental data challenge: extreme class imbalance.

In a typical production environment, the vast majority of castings are produced without significant flaws. Consequently, datasets collected from the shop floor contain thousands of records for sound castings, while examples for each specific defect category number only in the hundreds or even fewer. This skewed distribution presents a major obstacle for machine learning classifiers. A model trained on such imbalanced data can achieve deceptively high accuracy by simply learning to predict the majority class (sound castings) for every input, completely failing to identify the rare but critical defective cases. This renders the model useless for its intended purpose of sand casting defect detection and prevention.

To illustrate the severity of this problem, consider a dataset from the production of a complex steering bridge component, a critical safety part in material handling equipment. The initial, preprocessed dataset composition is presented below:

Class	Number of Samples (Raw)
Sound Casting	5,848
Cold Shut Defect	274
Blowhole Defect	399
Sand Inclusion Defect	359
Shrinkage Porosity Defect	148

The imbalance ratio between the majority class (sound) and the smallest minority class (shrinkage) is approximately 40:1. Training a model on this data without correction would inevitably lead to biased learning. This paper details our comprehensive investigation into resolving this data imbalance to build a reliable, high-accuracy predictive model for sand casting defects. We focus on the application and efficacy of the Synthetic Minority Over-sampling Technique (SMOTE) as a preprocessing algorithm, demonstrating its critical role in enabling practical AI-driven quality assurance for sand casting.

The core challenge of learning from imbalanced data has spurred significant research across various domains. In medical diagnostics, hybrid methods combining SMOTE with clustering algorithms like K-means have been proposed to improve classification precision. Studies have also focused on stabilizing the sometimes volatile performance of SMOTE-based techniques and developing deep learning approaches like Generative Adversarial Networks (GANs) for feature-enhanced data synthesis in mechanical fault diagnosis. Other strategies include cost-sensitive learning frameworks and adaptive over-sampling methods that tailor data generation to local distribution characteristics. These works collectively underscore that the choice and implementation of a data-level balancing strategy are not ancillary but central to the success of any predictive modeling task on imbalanced datasets, including sand casting defect analysis.

The Data Imbalance Problem and Conventional Solutions

Formally, a dataset is considered imbalanced when the class distributions are highly non-uniform. Most standard classification algorithms, including neural networks, support vector machines, and decision trees, operate under an implicit assumption of relatively balanced class frequencies. They aim to minimize the overall error rate, which, when faced with severe imbalance, is most efficiently reduced by favoring the majority class. The performance metrics become misleading; a model that classifies every steering bridge casting as “sound” would achieve an accuracy of:

$$ \text{Accuracy} = \frac{5848}{5848 + 274 + 399 + 359 + 148} \approx 86.5\% $$

While this accuracy appears high, the model’s recall (or true positive rate) for any defect class would be 0%, making it entirely ineffective for sand casting defect prediction. Therefore, specialized techniques are required to mitigate this bias. These techniques generally fall into two categories: algorithm-level (e.g., cost-sensitive learning) and data-level approaches. Our work focuses on data-level methods, which preprocess the dataset to create a more balanced distribution before model training.

Undersampling methods aim to balance the dataset by reducing the number of instances in the majority class. Random undersampling discards majority class examples arbitrarily, which risks removing potentially informative samples and can lead to loss of generality. More sophisticated methods like Edited Nearest Neighbors (ENN) attempt to selectively remove only those majority samples that are considered redundant or noisy. ENN operates by examining the k-nearest neighbors of each majority instance. If most (Mode strategy) or all (All strategy) of these neighbors belong to the majority class, the instance is deemed redundant and removed. The logic is that its informational content is already represented by its neighbors. However, in practical industrial datasets, such perfectly redundant points are scarce. Therefore, undersampling often cannot remove enough majority samples to achieve balance without jeopardizing the dataset’s integrity and representativeness of the normal process window.

Oversampling methods, conversely, increase the number of instances in the minority class. The simplest form is random replication, where existing minority samples are duplicated. This, however, provides no new information to the model and极易导致严重的过拟合（overfitting）, as the model learns to recognize specific data points rather than general patterns. The model becomes exceptionally good at classifying the repeated samples but fails to generalize to unseen, albeit similar, sand casting defect patterns. This fundamental limitation of naive oversampling necessitates the use of synthetic data generation techniques that can create new, plausible examples of the minority class.

The SMOTE Algorithm: Theory and Implementation for Sand Casting Data

The Synthetic Minority Over-sampling Technique (SMOTE) was introduced to overcome the limitations of simple replication. Its core principle is to generate new, synthetic minority class examples that are plausible interpolations between existing, real minority class examples. This approach helps the learner to broaden its conception of the minority class region in the feature space, rather than memorizing specific points. The algorithm operates in the feature space defined by the process parameters, such as carbon content, pouring temperature, sand compactibility, and moisture.

The SMOTE generation process for a given minority class is as follows:

For each instance $ x_i $ in the minority class, compute its k-nearest neighbors within the same minority class using a distance metric (typically Euclidean distance).
Based on the desired over-sampling ratio $ N $, for each $ x_i $, randomly select $ N $ of its k-nearest neighbors. Let one selected neighbor be $ x_{zi} $.
For each selected pair $ (x_i, x_{zi}) $, create a new synthetic sample $ x_{new} $ using linear interpolation:
$$ x_{new} = x_i + \delta \cdot (x_{zi} – x_i) $$
where $ \delta $ is a random number between 0 and 1. This formula constructs a point along the line segment connecting $ x_i $ and $ x_{zi} $ in the feature space.

The visual concept is that for a given minority sample, SMOTE identifies its neighborhood and populates that region with new, synthetic data points. This effectively “fills in the gaps” in the feature space for the minority class, encouraging the decision boundary to be more general and less skewed towards the majority class. For sand casting defect prediction, this means creating new, virtual records of defect occurrences that share characteristics of actual defective castings, thereby teaching the model the nuanced combinations of process parameters that lead to cold shuts, blowholes, etc.

In our implementation for the steering bridge casting data, we selected SMOTE over undersampling due to the inherent value and limited quantity of every collected production record. Discarding thousands of sound casting records was deemed unacceptable. The algorithm was implemented in Python, utilizing key libraries for data handling, numerical operations, and machine learning utilities. The process followed a structured pipeline: data loading and normalization, construction of a custom SMOTE class encapsulating the neighbor-finding and sample-synthesis logic, application of the algorithm to each defect class separately, and finally, the compilation and export of a fully balanced, synthetic-augmented dataset ready for model training.

Impact of SMOTE Preprocessing on Dataset Composition

The application of the SMOTE algorithm fundamentally transformed the dataset available for training the sand casting defect prediction model. By synthetically oversampling each of the four defect classes (cold shut, blowhole, sand inclusion, shrinkage), we elevated their sample counts to a level comparable with the majority ‘sound’ class. The specific results of this data preprocessing step are quantified in the table below.

Class	Sample Count (Raw Data)	Sample Count (After SMOTE)	Oversampling Factor
Sound Casting	5,848	5,848 (No change)	1.00x
Cold Shut Defect	274	~6,006	~21.92x
Blowhole Defect	399	~6,384	~16.00x
Sand Inclusion Defect	359	~6,408	~17.85x
Shrinkage Porosity Defect	148	~6,215	~42.00x
Total	~7,028	~30,861	—

The effect of this transformation can be conceptualized in a reduced feature subspace. Consider a 2D projection using two critical parameters: pouring temperature and clay content in the sand mixture. In the raw data scatter plot, the minority defect classes would appear as tiny, isolated clusters overwhelmed by a dense cloud of sound casting points. After applying SMOTE, the defect clusters expand into denser, more defined regions, creating a feature space where all classes have substantial representation. This balanced landscape is essential for training a classifier that can delineate the complex, often non-linear boundaries separating sound castings from various sand casting defect types.

Evaluation Framework for Defect Prediction Models

To rigorously assess the impact of data preprocessing, we constructed a neural network-based classifier for multi-class defect prediction. The network architecture included multiple fully connected hidden layers utilizing the Rectified Linear Unit (ReLU) activation function, defined as $ f(x) = \max(0, x) $. ReLU helps mitigate the vanishing gradient problem and allows for efficient training of deep networks. The output layer used the Softmax activation function to produce a probability distribution across the five possible classes (Sound, Cold Shut, Blowhole, Sand Inclusion, Shrinkage). For an input vector’s output logits $ z = [z_1, z_2, …, z_C] $ where $ C=5 $, the Softmax probability for class $ i $ is:

$$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $$

This ensures the model’s outputs are normalized probabilities that sum to 1. The model was trained to minimize the Categorical Cross-Entropy loss, which measures the discrepancy between the predicted probability distribution $ p $ and the true one-hot encoded label $ y $:

$$ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(p_{i,c}) $$

where $ N $ is the batch size, $ C $ is the number of classes, $ y_{i,c} $ is 1 if sample $ i $ belongs to class $ c $ else 0, and $ p_{i,c} $ is the predicted probability for that class. The primary metric for final model evaluation was accuracy, calculated as:

$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \times 100\% $$

However, during training and validation, we meticulously monitored both the loss and accuracy curves on separate training and testing splits to diagnose learning behavior and prevent overfitting.

Comparative Analysis: Model Performance Before and After SMOTE

The detrimental effect of the raw, imbalanced data on model training is starkly evident. When training the neural network on the original dataset, the optimization process was unstable and ineffective for learning the minority class distinctions. The cross-entropy loss curve exhibited non-monotonic behavior, with significant spikes and fluctuations during training iterations rather than a smooth, consistent descent. Concurrently, the accuracy metric showed volatile oscillations, particularly in the early-to-mid stages of training. This instability directly reflects the model’s struggle to reconcile the overwhelming gradient signals from the majority class with the sparse, conflicting signals from the minority defect classes. The final model converged to a state that primarily recognized sound castings, resulting in a test accuracy stuck around 86.5%—a figure that, as previously established, is misleading and functionally useless for sand casting defect prediction, as it corresponds to near-zero defect detection rates.

In dramatic contrast, the model trained on the SMOTE-balanced dataset demonstrated stable, robust, and high-performance learning. The cross-entropy loss for both training and validation sets decreased smoothly and monotonically across thousands of iterations, converging towards a very low value. This indicates that the model’s predicted probabilities were becoming increasingly aligned with the true labels across all classes. The accuracy curves followed a complementary smooth, monotonic increase, saturating at a high plateau. No erratic oscillations or instability were observed. The final evaluation of this model yielded a profound improvement in predictive capability.

Training Condition	Final Training Accuracy	Final Testing Accuracy	Training Behavior
Raw Imbalanced Data	~86.4%	~86.5%	Unstable, fluctuating loss/accuracy.
SMOTE-Balanced Data	~97.99%	~97.91%	Stable, smooth convergence.

The performance improvement of over 11 percentage points in accuracy is substantial. More importantly, this high accuracy now meaningfully represents correct classifications across all five classes. A detailed confusion matrix analysis (not shown here for brevity) confirmed that the model trained on balanced data achieved high recall and precision for each individual defect type—cold shuts, blowholes, sand inclusions, and shrinkage porosity—rather than just the ‘sound’ class. This is the critical outcome that enables practical application: the model can now reliably flag production runs with parameter sets that are likely to lead to specific sand casting defects.

Discussion and Implications for Foundry Practice

The successful application of the SMOTE algorithm to sand casting production data underscores a vital principle in industrial AI: data quality and preparation are often more decisive than the choice of the modeling algorithm itself. By directly addressing the fundamental issue of class imbalance, SMOTE preprocessing unlocked the latent potential of a standard neural network to perform high-accuracy, multi-class defect prediction. This approach provides a practical and powerful tool for foundries aiming to implement predictive quality systems.

The synthetic samples generated by SMOTE are not mere “fake data”; they represent plausible scenarios within the process parameter space that could lead to defects. They effectively teach the model the boundaries and contours of failure modes. For the complex steering bridge casting, this means the model learned the subtle, multi-dimensional interactions between, for example, low pouring temperature, high sand moisture, and a specific magnesium content that collectively predispose the casting to a cold shut defect. This nuanced understanding is what allows for accurate sand casting defect anticipation.

It is important to acknowledge the limitations and considerations for future work. SMOTE generates samples in the feature space without direct reference to the underlying physical metallurgical or thermo-fluid dynamics. Extremely sparse or isolated minority samples (outliers) can lead to the generation of noisy or unrealistic synthetic examples. Future research could integrate physics-based constraints or combine SMOTE with cleaning techniques like Tomek links. Furthermore, the exploration of advanced deep generative models, such as Variational Autoencoders (VAEs) or Conditional GANs, for synthesizing more complex and high-fidelity defect data represents a promising frontier. The integration of cost-sensitive learning at the algorithm level alongside data-level balancing could also further optimize the model for specific operational priorities, such as minimizing the missed detection of the most critical sand casting defect.

Conclusion

This investigation conclusively demonstrates that data imbalance is a primary impediment to developing effective data-driven models for predicting defects in sand casting. The Synthetic Minority Over-sampling Technique (SMOTE) serves as a highly effective preprocessing solution to this challenge. By scientifically augmenting the scarce defect data through intelligent synthetic generation, SMOTE transforms an imbalanced, unusable dataset into a balanced, information-rich foundation for model training. In our case study on a complex steering bridge casting, this transformation elevated the defect prediction model’s accuracy from a misleading 86.5% to a robust and meaningful 97.9%, enabling reliable identification of cold shuts, blowholes, sand inclusions, and shrinkage porosity. The methodology outlined provides a clear, implementable framework for foundries to leverage their existing production data, overcome the inherent imbalance of defect occurrences, and build powerful AI tools for proactive quality control and the reduction of costly sand casting defects.