Defect Prediction in Complex Casting Parts through Data Synthesis and Machine Learning

In the manufacturing of automotive core components and other critical industrial sectors, the implications of defects in complex castings are particularly severe. These flaws can compromise structural integrity, lead to catastrophic system failures, and result in significant financial and reputational losses. Therefore, the proactive prediction of defects in complex casting parts and the subsequent enhancement of production quality are tasks of paramount importance and urgency. Traditional quality control methods, such as X-ray or ultrasonic inspection, are applied post-casting and serve as detection rather than prediction tools. This reactive approach does not prevent the waste of resources on defective parts. Consequently, the development of intelligent systems capable of forecasting the quality of a casting part based on process parameters represents a transformative shift towards proactive, data-driven manufacturing.

This article addresses a critical and common challenge in applying machine learning to industrial data: severe class imbalance. In a typical foundry, the majority of casting part productions are successful, leading to a vast dataset of “non-defective” samples. In contrast, data for specific defect types like gas pores or sand inclusions are comparatively scarce. Training a predictive model on such imbalanced data yields a biased learner that may achieve high accuracy by simply always predicting “non-defective,” rendering it useless for the actual task of identifying potential flaws. Our research focuses on overcoming this hurdle for a specific, intricate casting part—an engine cylinder block. We present a comprehensive methodology that integrates advanced data synthesis techniques with robust machine learning models to build a highly accurate defect prediction system.

The core of our work lies in constructing a predictive model that can classify a casting part as either non-defective or as belonging to a specific defect category (e.g., gas pore, sand hole) based on a multitude of input parameters. These parameters encompass the entire production chain, including chemical composition of the melt, sand mold properties, core characteristics, and pouring conditions. By analyzing historical data where both the input parameters (process settings) and the output (final quality) are known, a machine learning model learns the complex, non-linear relationships that dictate the quality of the final casting part. This learned model can then be deployed to predict the outcome of new production runs, allowing for pre-emptive corrective actions.

The Imperative for Defect Prediction in Critical Casting Parts

The casting part under study is a cylinder block for a large-displacement diesel engine. This component is a quintessential example of a complex casting: it has thin walls, intricate internal passages for coolant and oil, and must withstand extreme mechanical and thermal stresses. A defect in such a safety-critical casting part is not merely a quality lapse; it is a potential source of engine failure. The traditional “produce-then-inspect” paradigm is costly and inefficient. Thus, the industrial demand is clear: a predictive system that can, with high confidence, forecast whether a given set of process parameters will yield a sound casting part or one prone to specific defects. This capability enables targeted process optimization, reduces scrap rates, improves resource efficiency, and strengthens supply chain reliability.

Confronting the Data Imbalance Challenge

The initial dataset collected from the production floor starkly illustrated the class imbalance problem. The distribution was as follows:

Quality Class	Number of Samples (Raw Data)
Non-Defective Casting Part	1,504
Casting Part with Gas Pore Defect	100
Casting Part with Sand Hole Defect	272

Training a classifier on this dataset would inevitably lead to a model biased toward the majority class (non-defective). To make meaningful predictions about the defective states of a casting part, we must artificially balance the dataset. Simple oversampling by duplicating minority class samples risks causing model overfitting. Instead, we employ the Synthetic Minority Over-sampling Technique (SMOTE), a sophisticated algorithm that generates synthetic examples for the minority classes.

The SMOTE Algorithm: Synthesizing Data for Better Prediction

The SMOTE algorithm operates in the feature space of the minority class. For each existing sample $x_i$ in a minority class (e.g., “gas pore” class), it performs the following steps:

Finds the $k$-nearest neighbors of $x_i$ within the same minority class. Typically, $k=5$.
Randomly selects one of these $k$ neighbors, denoted as $x_{zi}$.
Creates a new, synthetic sample $x_{new}$ by interpolating between $x_i$ and $x_{zi}$. The interpolation is controlled by a random number $\lambda$ between 0 and 1.

The mathematical formulation for generating each synthetic sample is:
$$ x_{new} = x_i + \lambda \cdot (x_{zi} – x_i) $$
where $\lambda \sim \text{rand}(0, 1)$.

This process effectively “fills in” the feature space between existing minority class samples, creating a more robust and continuous representation of the conditions that lead to a defective casting part. We applied SMOTE separately to the ‘gas pore’ and ‘sand hole’ classes, increasing their sample counts to be comparable with the majority class. The post-synthesis dataset was balanced, providing a solid foundation for training an unbiased model.

Quality Class	Number of Samples (After SMOTE)
Non-Defective Casting Part	1,504
Casting Part with Gas Pore Defect	~1,500
Casting Part with Sand Hole Defect	~1,600

Building the Machine Learning Prediction Model

With a balanced dataset, we proceeded to construct a supervised learning model. The input features (22 parameters) were meticulously selected to capture the key influential factors on the quality of the final casting part. These features were normalized to ensure stable and efficient training. The target variable was the defect class, encoded in a one-hot format for multi-class classification:

[1, 0, 0] : Non-Defective Casting Part
[0, 1, 0] : Casting Part with Gas Pore Defect
[0, 0, 1] : Casting Part with Sand Hole Defect

We implemented a fully-connected Artificial Neural Network (ANN), a proven and powerful architecture for capturing complex, non-linear relationships in high-dimensional data. The model was developed using Python with the TensorFlow/Keras library. The core steps in the model pipeline are summarized below:

Step	Action	Purpose
1. Data Loading & Preprocessing	Load raw data, handle missing values, perform one-hot encoding on the target.	Prepare data in a format digestible by the neural network.
2. Feature Normalization	Scale all input features to a standard range (e.g., [0,1] or [-1,1]) using Min-Max scaling.	Avoids domination by features with large magnitudes and accelerates convergence. Formula: $$ x’ = \frac{x – \min(x)}{\max(x) – \min(x)} $$
3. Data Synthesis	Apply the SMOTE algorithm to the normalized training set.	Balances the class distribution to prevent model bias.
4. Train-Test Split	Randomly split the synthetic dataset into training (≈66%) and testing (≈34%) sets.	Ensures the model is evaluated on unseen data to gauge its generalization ability.
5. Model Architecture	Define a sequential neural network with input layer, multiple hidden layers (with ReLU activation), and an output layer with softmax activation.	The softmax function outputs a probability distribution over the three classes for each casting part. $$ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{3} e^{z_j}} $$
6. Model Compilation	Specify the optimizer (Adam), loss function (Categorical Cross-Entropy), and evaluation metric (Accuracy).	Configures the learning process. Cross-entropy loss is ideal for multi-class classification.
7. Model Training	Iteratively present batches of training data to the network, calculate loss, and update weights via backpropagation.	The model learns the mapping from process parameters to casting part quality.
8. Model Evaluation	Use the held-out test set to calculate final prediction accuracy and other metrics.	Provides an unbiased estimate of the model’s performance in a real-world setting.

Results and Validation of the Predictive System

The trained model demonstrated exceptional performance in predicting the defect class of the complex casting part. We conducted multiple training runs with varying numbers of iterations (epochs) to ensure the stability and robustness of the results. The model consistently achieved prediction accuracies above 99% on both the training and, crucially, the independent test set. This high test accuracy confirms that the model generalized well and did not simply memorize the training data (overfit).

Training Iterations (Epochs)	Training Set Accuracy	Training Set Loss	Test Set Accuracy	Test Set Loss
1,000	99.28%	0.0342	99.35%	0.0371
1,100	99.45%	0.0297	98.90%	0.0428
1,200	99.32%	0.0318	99.61%	0.0297
1,300	99.42%	0.0312	99.16%	0.0307

The key observations from the results are:

High Predictive Accuracy: The average test accuracy across the runs was 99.37%, far exceeding the initial industrial target of 95%. This proves the model’s capability to reliably predict the quality outcome for a casting part.
Effective Generalization: The close alignment between training and test accuracy, along with the low and stable loss values on the test set, indicates that the model learned the underlying patterns governing defect formation rather than noise in the data.
Resolution Beyond Binary Classification: The model successfully performs multi-class prediction. It does not merely state if a casting part will be defective, but predicts the specific type of defect (gas pore or sand hole). This granular information is vastly more actionable for process engineers, as the root causes for these defects differ.

The convergence curves for accuracy and loss were smooth and monotonic, showing no signs of instability or overfitting during the training process. This validates the effectiveness of both the SMOTE-based data preparation and the chosen neural network architecture.

Discussion: Implications and Industrial Integration

The success of this methodology has profound implications for the foundry industry. By transitioning from defect detection to defect prediction, manufacturers can adopt a truly proactive quality assurance strategy. The practical deployment of such a model could work as follows: before initiating a production run for a critical casting part, the planned process parameters are fed into the trained model. The model provides a probabilistic forecast of the resulting quality. If a high probability of a specific defect is predicted, engineers can adjust the relevant parameters—such as pouring temperature, sand composition, or core strength—in a virtual loop before any metal is poured, thereby preventing the creation of a flawed casting part and saving substantial costs.

Furthermore, the model acts as a knowledge discovery tool. By employing techniques like feature importance analysis or sensitivity analysis, one can interrogate the model to understand which input parameters most significantly influence the probability of a gas pore or a sand hole in the final casting part. This insight guides focused process improvement efforts and deepens the fundamental understanding of the casting process for that specific component.

The core innovation demonstrated here is the synergistic combination of data synthesis and machine learning to solve a fundamental data scarcity problem in industrial AI. The SMOTE algorithm was instrumental in creating a viable dataset from which a high-performance model could be learned. This approach is not limited to cylinder blocks; it is universally applicable to any manufacturing domain where quality data is imbalanced, be it other types of complex casting parts, forgings, welded assemblies, or composite materials.

Future Directions and Enhancements

While the current results are highly promising, several avenues exist for further enhancement and research:

Advanced Data Synthesis: Exploring more recent variations of SMOTE (e.g., Borderline-SMOTE, ADASYN) or generative adversarial networks (GANs) could potentially create even more realistic and informative synthetic samples for defective casting part classes.
Model Architecture Exploration: Testing other machine learning architectures, such as Gradient Boosting Machines (XGBoost, LightGBM) or different neural network topologies (e.g., with dropout, batch normalization), could yield marginal improvements or better computational efficiency.
Incorporating Temporal and Image Data: The current model uses scalar process parameters. Future systems could integrate time-series data from sensors during pouring/cooling or even pre-process images of sand molds/cores to provide a more comprehensive predictive picture for the casting part.
Real-time Prediction and Closed-loop Control: The ultimate goal is to integrate the predictive model into a real-time monitoring and control system. As sensor data streams in during production, the model could provide continuous quality forecasts, enabling dynamic process adjustments to “steer” the production of each individual casting part towards a defect-free outcome.

Conclusion

This research has successfully established a robust framework for predicting defects in complex casting parts by tackling the pervasive problem of imbalanced industrial data. We demonstrated that the Synthetic Minority Over-sampling Technique (SMOTE) is highly effective in creating a balanced dataset necessary for training an unbiased and accurate classifier. The subsequent application of an Artificial Neural Network model yielded a prediction system with an average accuracy exceeding 99%, capable of distinguishing not only between defective and sound castings but also between specific defect types such as gas pores and sand holes. This work underscores the transformative potential of combining data-centric techniques like synthetic data generation with powerful machine learning algorithms to solve real-world manufacturing challenges. The developed methodology provides a scalable and effective blueprint for implementing proactive quality prediction systems, paving the way for smarter, more efficient, and more reliable production of critical metal components across the industry. The ability to foresee and prevent defects in a casting part before it is even made represents a significant leap forward in the pursuit of zero-defect manufacturing.