SMOTE for Predicting Sand Casting Defects: Mitigating Imbalance in Production Data

In my research and practical experience within the foundry industry, I have consistently encountered a critical challenge: predicting the occurrence of defects in complex sand castings using data-driven methods. While the vision of using historical production data to forecast and prevent defects like cold shuts, blowholes, sand inclusions, and shrinkage is powerful, its execution is often hampered by the fundamental nature of the data itself. The vast majority of castings produced are, thankfully, defect-free. This results in datasets where the examples of sand casting defects are vastly outnumbered by records of sound castings—a classic and severe case of class imbalance. This imbalance renders many standard machine learning models practically useless, as they tend to simply learn to predict the “no defect” majority class with high accuracy but fail completely at identifying the rare but critical defective cases. This document details my comprehensive approach to solving this problem by applying the Synthetic Minority Oversampling Technique (SMOTE) as a crucial data preprocessing step, significantly enhancing the performance of predictive models for sand casting defects.

The Data Challenge in Sand Casting

My work focuses on complex, safety-critical components, such as steering axle housings for machinery. These are typically produced via sand casting using ductile iron (e.g., QT450-10). The production process involves numerous parameters across different stages:

Alloy Chemistry: Concentrations of Carbon (C), Silicon (Si), Manganese (Mn), Magnesium (Mg), Sulfur (S), Phosphorus (P), and Aluminum (Al).
Molding/Sand Parameters: New sand percentage, bentonite percentage, compactability, moisture content, old sand temperature, and shear strength.
Pouring Parameters: Pouring temperature, pouring time, inoculant percentage, and total poured weight.

Each production run logs these parameters alongside the final quality inspection result. After initial cleaning to remove irrelevant records, the core dataset structure resembles the following:

Sample ID	C (%)	Si (%)	Pouring Temp. (°C)	Compactability (%)	Moisture (%)	Defect Type
Cast-001	3.71	2.68	1407	36.88	1.78	None
Cast-002	3.85	2.70	1388	40.06	2.19	Cold Shut
Cast-003	3.74	2.68	1410	41.41	2.00	Blowhole
Cast-004	3.68	2.82	1399	47.43	2.11	Sand Inclusion
Cast-005	3.72	2.69	1406	40.90	2.05	Shrinkage

The stark reality of this data is its imbalance. In a typical dataset I work with, the distribution might be:

Class	Number of Samples	Percentage
No Defect	5,848	~85.3%
Blowhole	399	~5.8%
Cold Shut	274	~4.0%
Sand Inclusion	359	~5.2%
Shrinkage	148	~2.2%
Total	6,855	100%

A model trained naively on this data could achieve over 85% accuracy by blindly predicting “No Defect” for every new casting. This is a catastrophic failure for quality control, as the primary goal is to identify the defective 15%. This imbalance is the central obstacle to reliable prediction of sand casting defects.

Addressing Imbalance: The Rationale for SMOTE

Two primary families of techniques exist to handle class imbalance: undersampling and oversampling. Undersampling, such as the Edited Nearest Neighbors (ENN) algorithm, removes samples from the majority class. While ENN attempts to do this intelligently by removing points whose neighbors are predominantly from the majority class, it inherently discards valuable data. In industrial contexts where data collection is expensive, discarding thousands of valid production records is often unacceptable.

Oversampling, conversely, aims to increase the number of minority class samples. Simple replication (duplication) of existing defect records leads to severe overfitting, as the model learns from identical examples. The Synthetic Minority Oversampling Technique (SMOTE) provides an elegant solution by generating new, synthetic examples that are plausible variations of the real ones. The core idea is to interpolate between existing minority class samples in feature space. For a given minority sample $ x_i $, the algorithm finds its $ k $-nearest neighbors from the same class. A synthetic sample $ x_{new} $ is then created along the line segment joining $ x_i $ and one of its randomly chosen neighbors $ x_{zi} $:

$$ x_{new} = x_i + \lambda \cdot (x_{zi} – x_i) $$

where $ \lambda $ is a random number between 0 and 1. This effectively creates new data points that lie in the “feature space region” where defects are known to occur, thereby enriching the dataset without mere duplication. This method is particularly suited for the multi-dimensional, continuous parameter space of sand casting defects prediction.

Methodology: Implementing SMOTE for Casting Data

My implementation pipeline is designed to be robust and reproducible. The core steps are as follows:

Data Preparation: Load and normalize the casting production data. Normalization (e.g., scaling features to a [0,1] range) is crucial for distance-based algorithms like SMOTE and subsequent neural network training. The target variable is the defect type (multi-class: No Defect, Cold Shut, Blowhole, Sand Inclusion, Shrinkage).
SMOTE Application: Apply the SMOTE algorithm separately to each minority defect class. The key parameters are:
- k_neighbors: The number of nearest neighbors to consider for interpolation. A value of 5 is a common and effective starting point.
- sampling_strategy: The desired ratio of minority to majority class samples. I typically aim to balance the dataset, setting the target for each defect class equal to the majority class count or a significant fraction thereof.
Synthetic Dataset Creation: The original majority class samples are combined with the original and newly synthesized minority class samples to form a balanced training set.

Here is a summary of the key parameters and their settings in my study:

Parameter	Value/Description	Rationale
k_neighbors	5	Standard choice; balances locality and diversity in synthetic sample generation.
sampling_strategy	Balanced (raise all classes to match majority)	To create a perfectly balanced dataset for initial model training and evaluation.
Random State	Fixed seed (e.g., 42)	Ensures reproducibility of the synthetic data generation process.

The visual impact of SMOTE on a 2D projection (e.g., using Pouring Temperature and Bentonite %) of the data is clear. Initially, minority class points are sparse. After SMOTE, the feature space occupied by the defect classes is densely and representatively populated, bridging gaps between isolated real samples.

The quantitative result is a transformed dataset. Using the example numbers from earlier, the post-SMOTE dataset would be substantially larger and balanced:

Class	Original Samples	Synthetic Samples Generated	Total in Balanced Set
No Defect	5,848	0	5,848
Blowhole	399	5,985	6,384
Cold Shut	274	5,732	6,006
Sand Inclusion	359	6,049	6,408
Shrinkage	148	6,067	6,215
Total	6,855	23,833	30,861

Predictive Modeling and Performance Evaluation

To evaluate the impact of SMOTE preprocessing, I constructed a neural network classifier. The architecture and evaluation metrics were chosen as follows:

Architecture: A fully connected network with multiple hidden layers using ReLU (Rectified Linear Unit) activation functions. ReLU is defined as $ f(x) = max(0, x) $, helping to mitigate the vanishing gradient problem and enabling efficient training.
Output Layer: A softmax activation function for multi-class classification. For a given output node $ z_i $, softmax calculates the probability as:
$$ P(\text{class}_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $$
where $ C $ is the number of classes (5 in our case).
Loss Function: Categorical Cross-Entropy, which is the standard for multi-class classification. For a single sample with true class label $ y $ (one-hot encoded) and predicted probability distribution $ \hat{p} $, the loss is:
$$ L = -\sum_{c=1}^{C} y_c \cdot \log(\hat{p}_c) $$
This loss heavily penalizes confident but incorrect predictions.
Primary Metric: Classification Accuracy, while informative on the balanced set, was paired with detailed per-class precision and recall analysis to ensure all sand casting defects were being identified correctly.

The contrast in model behavior when trained on the raw imbalanced data versus the SMOTE-balanced data was profound.

Training on Imbalanced Data: The learning process was unstable and ineffective. The cross-entropy loss did not converge smoothly, often showing periods of increase, indicating the model’s struggle to find a consistent gradient direction. Accuracy plateaued at a level approximately equal to the majority class proportion (~86.5%), with wild fluctuations during training. A per-class breakdown would reveal near-zero recall for all defect classes.

Training on SMOTE-Balanced Data: The improvement was dramatic. The training process became stable and convergent.

Iteration Block	Training Accuracy	Training Loss	Validation Accuracy	Validation Loss
0-500	0.208 -> 0.947	10.91 -> 0.222	0.205 -> 0.944	10.97 -> 0.225
1000-1500	0.964 -> 0.970	0.154 -> 0.127	0.962 -> 0.968	0.159 -> 0.132
2000-2500	0.973 -> 0.975	0.111 -> 0.101	0.971 -> 0.973	0.117 -> 0.107
4000-5000	0.978 -> 0.980	0.079 -> 0.075	0.978 -> 0.979	0.083 -> 0.080

The cross-entropy loss decreased monotonically and smoothly towards zero. The accuracy increased consistently, ultimately reaching stable values around 97.9% for both training and validation sets. More importantly, this high accuracy was now achieved by correctly classifying all classes, not just the majority. The model learned the genuine, complex relationships between process parameters and the occurrence of various sand casting defects.

Discussion and Practical Implications

The application of SMOTE in this context is more than a technical preprocessing step; it is an enabler for practical, data-driven quality control in sand foundries. By transforming an imbalanced dataset into a balanced one, it allows modern machine learning techniques to fulfill their potential for predicting sand casting defects. The leap in model accuracy from ~86.5% (effectively a non-functional model) to over 97.9% (a highly reliable model) is transformative.

However, several important considerations must be noted:

Feature Space Validity: SMOTE generates synthetic samples based on interpolation. It assumes the feature space between minority samples is valid and representative of potential defect conditions. This is generally reasonable for continuous process parameters like temperature and chemical composition but requires domain knowledge to confirm.
Noise Amplification: If the original minority data contains outliers or mislabeled samples, SMOTE can amplify this noise by generating synthetic samples around them. Careful initial data cleaning is essential.
Integration with Model Training: The balanced dataset is used for training. When deploying the final model in production, it will make predictions on new, real-world imbalanced data. The model’s learned decision boundaries from the balanced training are robust and generalize well to this real distribution, as evidenced by the high validation accuracy on a held-out set that maintains the original imbalance.
Beyond SMOTE: Advanced variants of SMOTE (e.g., Borderline-SMOTE, SVM-SMOTE) or hybrid methods could be explored to further refine synthetic sample generation, especially if certain defect classes are particularly difficult to separate.

The practical workflow I advocate is: 1) Collect and clean historical production data, 2) Apply SMOTE or a similar advanced oversampling technique to balance defect classes, 3) Train a robust classifier (neural network, gradient boosting, etc.) on the balanced data, and 4) Deploy the model to provide real-time or batch predictions on new production runs, flagging high-risk castings for enhanced inspection.

Conclusion

The severe class imbalance inherent in sand casting production data is the principal barrier to implementing effective defect prediction systems. My work demonstrates that the Synthetic Minority Oversampling Technique (SMOTE) is a highly effective and essential preprocessing tool to overcome this barrier. By scientifically generating synthetic examples of sand casting defects—such as cold shuts, blowholes, sand inclusions, and shrinkage—within the multi-dimensional feature space of process parameters, SMOTE enables the creation of balanced, informative datasets. Training predictive models on these balanced datasets leads to stable convergence and high classification accuracy, transforming an academic concept into a practical tool for improving yield, reducing scrap, and enhancing quality control in sand foundries. This approach provides a foundational methodology for leveraging the vast amounts of data generated in modern manufacturing to proactively address the age-old challenge of sand casting defects.