Application of SMOTE Data Preprocessing in Predicting Metal Casting Defects

In modern industrial processes, metal casting defects pose significant challenges to product quality and efficiency. As a researcher focused on data-driven solutions, I have encountered the pervasive issue of imbalanced datasets in predicting these defects, particularly in sand casting environments. The scarcity of data for specific metal casting defects such as cold shuts, blowholes, sand inclusions, and shrinkage cavities compared to non-defective samples severely hampers the accuracy of predictive models. This imbalance often leads to models that are biased toward the majority class, rendering them ineffective for practical applications. To address this, I explored the application of the Synthetic Minority Over-sampling Technique (SMOTE) as a data preprocessing method to enhance the prediction of metal casting defects. Through this study, I aimed to scientifically augment imbalanced datasets and evaluate the impact on model performance, ultimately striving to improve the reliability of defect prediction systems in metal casting processes.

The data utilized in this investigation were sourced from the production records of steering bridge castings, which are critical components in engineering machinery made from ductile iron grade QT450-10. These castings undergo rigorous conditions, requiring high mechanical strength and fatigue resistance. The initial dataset comprised numerous parameters, including chemical composition, pouring parameters, and sand treatment metrics. However, after preliminary preprocessing to remove irrelevant entries, the data revealed a stark imbalance: out of approximately 7,000 records, only about 1,000 pertained to defective castings, with the majority being non-defective. This distribution is typical in real-world metal casting scenarios, where defect occurrences are rare but critical to identify. The table below summarizes the initial dataset characteristics after basic preprocessing, highlighting the severe imbalance in metal casting defects data.

Defect Category	Number of Samples
Non-Defective	5,848
Cold Shut	274
Blowhole	399
Sand Inclusion	359
Shrinkage Cavity	148

Data imbalance in metal casting defects prediction arises because most machine learning algorithms assume balanced class distributions. When one class, such as non-defective castings, dominates the dataset, models tend to prioritize accuracy on that class while neglecting minority classes like specific metal casting defects. For instance, if a model simply predicts all samples as non-defective, it can achieve high accuracy but fails to detect actual defects, which is unacceptable in quality control. To mitigate this, I considered both undersampling and oversampling techniques. Undersampling methods, such as the Edited Nearest Neighbor (ENN) algorithm, remove samples from the majority class but risk losing valuable information. In contrast, oversampling techniques like SMOTE generate synthetic samples for minority classes, preserving data integrity while addressing imbalance. Given the importance of every production record in metal casting, I opted for SMOTE to handle the imbalance in metal casting defects data.

The core idea of SMOTE involves creating synthetic examples for minority classes by interpolating between existing samples. For each minority sample $ x_i $, the algorithm identifies its k-nearest neighbors within the same class. Then, for a selected neighbor $ x_{zi} $, a new synthetic sample $ x_n $ is generated using the formula: $$ x_n = x_i + \text{rand}(0,1) \times (x_{zi} – x_i) $$ where $ \text{rand}(0,1) $ is a random number between 0 and 1. This approach effectively expands the feature space for minority classes, such as various metal casting defects, without mere duplication, thus reducing overfitting. The process begins by loading the preprocessed data, normalizing it, and applying SMOTE to generate a balanced dataset. The following table illustrates the data volume before and after SMOTE preprocessing, demonstrating how it addresses the imbalance in metal casting defects.

Category	Before Preprocessing	After SMOTE Preprocessing
Non-Defective	5,848	5,848
Blowhole Defect	399	6,384
Cold Shut Defect	274	6,006
Sand Inclusion Defect	359	6,408
Shrinkage Cavity Defect	148	6,215

To implement SMOTE for metal casting defects prediction, I used Python libraries such as NumPy for numerical computations, xlrd and xlwt for handling Excel files, and scikit-learn for machine learning utilities. The steps included data loading, normalization, and the creation of a custom SMOTE class to generate synthetic samples. This process ensured that the expanded dataset maintained the underlying patterns of metal casting defects while achieving balance. Visualizing the data before and after preprocessing, for example, using features like pouring temperature and sand mixture ratios, showed a more uniform distribution across classes. The synthetic samples filled gaps in the feature space, enhancing the representativeness of metal casting defects in the training set.

Evaluating the impact of data preprocessing on metal casting defects prediction required a robust model architecture. I designed a neural network with multiple hidden layers using ReLU activation functions, defined as $ f(x) = \max(0, x) $, to introduce non-linearity. For the output layer, Softmax activation was employed to handle multi-class classification, assigning probabilities to each defect category. The Softmax function for a node $ i $ is given by: $$ \text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} $$ where $ C $ is the number of classes, such as different metal casting defects. Performance metrics included accuracy and cross-entropy loss. Accuracy measures the proportion of correct predictions: $$ \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Samples}} \times 100\% $$ Cross-entropy loss, used for optimization, is defined as: $$ L = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \cdot \log(p_i) + (1 – y_i) \cdot \log(1 – p_i) \right] $$ where $ N $ is the number of samples, $ y_i $ is the true label, and $ p_i $ is the predicted probability for the positive class. These metrics were crucial for assessing how well the model learned to identify metal casting defects.

Before applying SMOTE, the defect prediction model exhibited suboptimal performance due to data imbalance. During 5,000 iterations, the cross-entropy loss fluctuated, occasionally increasing, which indicated instability in learning. Similarly, accuracy curves showed oscillations, particularly between iterations 800 and 1,000, suggesting that the model struggled to generalize across imbalanced metal casting defects classes. The maximum accuracy achieved was approximately 86.50%, which is inadequate for practical applications where detecting metal casting defects is critical. The table below provides detailed values of accuracy and cross-entropy loss during training and testing before preprocessing, highlighting the inconsistencies caused by imbalance in metal casting defects data.

Iteration	Training Accuracy	Training Loss	Testing Accuracy	Testing Loss
0	0.3595	1.8711	0.3724	1.8786
500	0.8371	0.3899	0.8457	0.3741
1000	0.8544	0.3587	0.8650	0.3534
1500	0.8565	0.3532	0.8662	0.3490
2000	0.8565	0.3487	0.8667	0.3454
2500	0.8574	0.3443	0.8690	0.3431
3000	0.8601	0.3439	0.8633	0.3461
3500	0.8626	0.3384	0.8679	0.3410
4000	0.8628	0.3353	0.8656	0.3391
4500	0.8643	0.3304	0.8650	0.3364
5000	0.8639	0.3280	0.8650	0.3383

After implementing SMOTE preprocessing, the model’s performance improved dramatically. The cross-entropy loss decreased monotonically and smoothly over iterations, approaching zero, which indicates stable learning and better alignment with the true distribution of metal casting defects. Similarly, accuracy increased consistently, reaching approximately 97.99% on the training set and 97.91% on the test set. This enhancement underscores the effectiveness of SMOTE in handling imbalanced metal casting defects data. The following table details the performance metrics after preprocessing, showing the progressive improvement in predicting metal casting defects.

Iteration	Training Accuracy	Training Loss	Testing Accuracy	Testing Loss
0	0.2081	10.9126	0.2046	10.9672
500	0.9474	0.2219	0.9443	0.2255
1000	0.9644	0.1541	0.9620	0.1591
1500	0.9703	0.1266	0.9680	0.1323
2000	0.9726	0.1114	0.9713	0.1170
2500	0.9749	0.1014	0.9732	0.1066
3000	0.9757	0.0939	0.9744	0.0988
3500	0.9763	0.0879	0.9751	0.0927
4000	0.9772	0.0829	0.9762	0.0876
4500	0.9785	0.0788	0.9781	0.0834
5000	0.9799	0.0753	0.9791	0.0798

The results demonstrate that SMOTE data preprocessing significantly enhances the prediction of metal casting defects by addressing dataset imbalance. By generating synthetic samples for minority classes, the algorithm enables models to learn more representative patterns, leading to higher accuracy and stability. In this study, the accuracy improvement from 86.50% to 97.91% highlights the potential of SMOTE in industrial applications for quality control. Future work could explore hybrid approaches combining SMOTE with other techniques to further optimize metal casting defects prediction. Overall, this approach provides a scalable solution for managing imbalanced data in metal casting processes, ultimately contributing to reduced defect rates and improved production efficiency.