The Data Preprocessing Revolution in Sand Casting Defect Prediction

In the landscape of modern manufacturing, the imperative for quality is paramount. Within the foundry sector, sand casting remains a cornerstone process for producing complex, high-integrity components. However, the journey from molten metal to a flawless final product is fraught with potential pitfalls. Defects such as cold shuts, blowholes, sand inclusions, and shrinkage cavities can compromise the structural integrity and performance of critical parts, leading to significant economic losses and safety concerns. Predicting the occurrence of these sand casting defects using data-driven models presents a powerful avenue for proactive quality control. Yet, a fundamental and pervasive challenge obstructs the path to reliable prediction: the severe imbalance inherent in real-world production data. This article delves into this critical issue, exploring the transformative application of the Synthetic Minority Over-sampling Technique (SMOTE) as a preprocessing solution to unlock the true potential of machine learning in sand casting defect prediction.

The core of the data-driven approach lies in learning patterns from historical production records. These records typically encompass a multitude of process parameters: alloy chemical composition (e.g., C, Si, Mn content), pouring parameters (temperature, time), and sand treatment parameters (compactability, clay content, moisture). A supervised learning model aims to find the complex, non-linear relationships between these input features and the output—the presence or type of a sand casting defect. The ultimate goal is to create a digital twin of the process that can forecast quality issues before they manifest physically.

However, the reality of manufacturing data starkly contrasts with the ideal datasets often used in academic machine learning. In a typical production run for a high-value component, the vast majority of castings are produced without critical defects. Consequently, data collection yields an abundance of records labeled as “non-defective” and a paucity of records for each specific defect class. To illustrate this imbalance, consider a hypothetical but realistic dataset derived from the production of a complex steering axle casting:

Class Label	Number of Samples
Non-Defective	5,848
Blowhole Defect	399
Cold Shut Defect	274
Sand Inclusion Defect	359
Shrinkage Cavity Defect	148

This skewed distribution is not merely a statistical curiosity; it is a fundamental barrier to model performance. Most standard classification algorithms operate under an implicit assumption of relatively balanced class distributions. When faced with extreme imbalance, these models become biased toward the majority class. A trivial model that simply predicts “non-defective” for every input would achieve an accuracy of over 90% on the above data, yet it would be utterly useless for the practical task of identifying impending sand casting defects. The model’s learning process is dominated by the patterns of the majority class, failing to discern the subtle signatures that lead to the rare but critical defect events.

To overcome this, data-level preprocessing strategies are employed. These strategies primarily fall into two categories: undersampling and oversampling. Undersampling involves randomly or strategically removing samples from the majority class to balance the dataset. Techniques like Edited Nearest Neighbors (ENN) remove majority-class instances whose class labels differ from most of their nearest neighbors. While it can reduce redundancy, undersampling often discards potentially useful information, which is particularly risky when the total volume of production data is already limited.

Oversampling, conversely, aims to increase the number of minority-class samples. The naive approach of simply duplicating existing defect records leads to overfitting, as the model memorizes the exact instances without learning to generalize. This is where advanced synthetic data generation algorithms like SMOTE prove indispensable for sand casting defect prediction. The core philosophy of SMOTE is to create new, plausible examples of the minority class by interpolating between existing ones within the feature space. The algorithm operates on a per-feature basis, enhancing the diversity of the defect dataset.

The SMOTE algorithm process can be formalized as follows. For each instance $ x_i $ in the minority class (e.g., a specific sand casting defect record):

Find its $ k $-nearest neighbors within the same minority class, based on Euclidean distance in the normalized feature space.
For a specified oversampling ratio $ N $, select $ N $ of these neighbors randomly. Let one such neighbor be $ x_{zi} $.
Generate a new synthetic sample $ x_{new} $ by interpolating along the line segment between $ x_i $ and $ x_{zi} $:
$$ x_{new} = x_i + \lambda \cdot (x_{zi} – x_i) $$
where $ \lambda $ is a random number uniformly distributed between 0 and 1, i.e., $ \lambda \sim U(0, 1) $.

This mechanism is conceptually illustrated below. By generating points along the lines connecting existing minority class instances, SMOTE effectively populates the sparsely represented regions of the feature space, creating a more robust and continuous decision boundary for the classifier.

The practical implementation for sand casting data involves a clear pipeline. After initial data cleaning to remove irrelevant parameters and normalize the key features (chemistry, temperatures, sand properties), the SMOTE algorithm is applied independently to each defect class until each class’s count is commensurate with the majority class. A typical implementation in Python utilizes libraries like `imbalanced-learn` or a custom-built function. The result is a balanced, synthetic dataset where the model can learn without majority-class bias. The transformation is stark:

Class Label	Original Samples	After SMOTE Preprocessing
Non-Defective	5,848	5,848 (unchanged)
Blowhole Defect	399	~6,384
Cold Shut Defect	274	~6,006
Sand Inclusion Defect	359	~6,408
Shrinkage Cavity Defect	148	~6,215

To quantitatively evaluate the impact of this preprocessing on sand casting defect prediction, we must establish robust performance metrics. For a multi-class classification neural network model, accuracy alone is insufficient, especially given our initial imbalance context. Therefore, we monitor both accuracy and cross-entropy loss during training. The cross-entropy loss for a multi-class problem is given by:
$$ L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \cdot \log(p_{i,c}) $$
where $ N $ is the number of samples, $ C $ is the number of classes (defect types + non-defective), $ y_{i,c} $ is a binary indicator if class $ c $ is the correct classification for sample $ i $, and $ p_{i,c} $ is the predicted probability from the softmax output layer that sample $ i $ belongs to class $ c $. Accuracy is the straightforward ratio:
$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $$

When a model is trained on the original, highly imbalanced sand casting dataset, the learning dynamics are unstable and ultimately ineffective. The cross-entropy loss curve exhibits non-monotonic behavior, with sudden increases indicating periods where the model’s predictions diverge significantly from the true distribution. Concurrently, the accuracy curve shows volatile oscillations, particularly in the mid-stages of training. After extensive iterations, such a model may plateau at a deceptively high overall accuracy (e.g., ~86.5%), but this metric is dominated by its ability to correctly predict the “non-defective” class while failing miserably on all defect classes.

The contrast after applying SMOTE preprocessing is dramatic. Training on the balanced dataset yields smooth, convergent learning curves. The cross-entropy loss decreases monotonically and converges towards zero, indicating the model’s predictions are aligning with the true data distribution. The accuracy climbs steadily and smoothly to a high plateau. Critically, this final accuracy (e.g., ~97.9%) now reflects a model that is genuinely proficient at distinguishing between all classes, including the various types of sand casting defects. The detailed performance comparison across training iterations underscores this stability and superiority:

Iteration	Train Accuracy (Balanced)	Train Loss (Balanced)	Test Accuracy (Balanced)	Test Loss (Balanced)
0	0.208	10.913	0.205	10.967
1000	0.964	0.154	0.962	0.159
2000	0.973	0.111	0.971	0.117
3000	0.976	0.094	0.974	0.099
5000	0.980	0.075	0.979	0.080

In conclusion, the challenge of imbalanced data is a critical bottleneck in developing reliable predictive models for industrial processes like sand casting. The scarcity of defect samples compared to non-defective ones leads to biased, inaccurate models that are blind to the very problems they are meant to foresee. The SMOTE data preprocessing algorithm provides an elegant and powerful solution by synthesizing new, plausible examples of minority defect classes directly within the feature space of process parameters. This study demonstrates that applying SMOTE to a real-world dataset of complex castings transforms the learning landscape. It enables the creation of a balanced dataset from which a neural network model can learn robust, generalizable patterns for all defect types. The result is a profound leap in predictive performance, elevating accuracy from a misleading 86.5% to a genuinely proficient 97.9%. This signifies more than a statistical improvement; it represents a practical step towards trustworthy, data-driven prescriptive analytics in the foundry. By reliably forecasting the risk of sand casting defects such as cold shuts, blowholes, and shrinkage, manufacturers can move from reactive inspection to proactive process control, optimizing parameters in real-time to prevent defects before they occur, thereby enhancing quality, reducing waste, and strengthening operational efficiency.