Quality Analysis of Casting Part Defects Based on Sand Casting Production Data

The sand casting process is characterized by its complexity, involving numerous interacting process parameters. This intricate interplay makes it exceptionally challenging to control the final quality of the casting part, often leading to high scrap rates. This work proposes a data-driven analytical methodology leveraging big data models to reduce defects and enhance production efficiency. By systematically collecting and analyzing historical production data, we aim to uncover hidden relationships between process parameters and defect formation, ultimately building predictive and prescriptive models for improved process control.

The core challenge in sand casting quality control stems from the multivariate and nonlinear nature of the process. Traditional trial-and-error methods or reliance on isolated process control software can be inefficient and may not capture the complex synergies between parameters. A data-driven approach, which mines historical records for patterns and correlations, offers a powerful complementary strategy. This methodology involves constructing a comprehensive dataset of process parameters and corresponding defect outcomes for a specific casting part, followed by rigorous data preprocessing, exploratory analysis, model building, and finally, deploying the model for prediction and optimization.

1. Construction of the Casting Part Quality Dataset

The study focuses on a specific, high-volume casting part: a steering bridge component for forklifts. This casting part, with dimensions of 840 mm × 224 mm × 370 mm, an average wall thickness of 27 mm, and a mass of 51 kg, was selected due to its structural complexity and susceptibility to recurring defects. A key aspect of modern foundry practice is the single-piece management of casting parts, enabling precise tracking of process parameters for each individual unit. This forms the foundation for building a high-fidelity dataset.

Data was collected over a sustained production period, capturing 18 critical process parameters for each manufacturing cycle of the casting part. The parameters span various stages of the process, as summarized in Table 1.

td>9.8

Table 1. Statistical Characteristics of Sand Casting Process Parameters for the Steering Bridge Casting Part
Process Parameter	Max	Min	Mean	Variance
Pouring Temperature (°C)	1415	1385.0	1401.4	6.22
Carbon Content, w(C) (%)	3.85	3.61	3.76	0.056
Silicon Content, w(Si) (%)	2.92	2.60	2.71	0.053
Manganese Content, w(Mn) (%)	0.66	0.38	0.516	0.050
Phosphorus Content, w(P) (%)	0.047	0.013	0.027	0.005
Sulfur Content, w(S) (%)	0.018	0.006	0.012	0.002
Magnesium Content, w(Mg) (%)	0.057	0.034	0.045	0.005
Aluminum Content, w(Al) (%)	0.04	0.017	0.025	0.004
Pouring Weight (kg)	145	128.0	134.9	2.16
Pouring Time (s)	27.2	11.9	17.3	2.01
Inoculant Amount (g)	92	24.0	49.6	9.69
Compactness (%)	48.82	35.07	39.82	1.28
Shear Strength (kPa)	60	2.0	5.03	5.76
Sand Mill Temperature (°C)	48.8	33.4	41.27	2.73
Sand Mill Moisture (%)	2.94	1.38	1.99	0.195
Bentonite Content (%)	58.5	12.5	23.01	1.89
Clay Ratio (%)	13.9	11.85	0.62
New Sand Content (%)	40	0.0	10.64	12.49

The corresponding quality outcome for each casting part was recorded, categorizing it as either sound or defective, with the defect type specified. Four major defect types were considered: Gas Porosity, Sand Inclusion, Cold Shut, and Shrinkage. The initial dataset comprised 6,390 samples related to the production of this casting part. The class distribution was imbalanced, with 5,202 sound samples and 1,188 defective samples (400 Gas Porosity, 359 Sand Inclusion, 273 Cold Shut, 148 Shrinkage). To mitigate potential bias from severe imbalance during model training, the sound samples were randomly down-sampled to create a more balanced working dataset.

2. Data Preprocessing and Exploratory Analysis

2.1. Data Cleaning and Outlier Removal

Raw industrial data often contains noise, missing values, and outliers due to sensor malfunctions or manual entry errors. A robust cleaning process is essential. The 3σ (three-sigma) rule was applied to features with distributions approximating normality. For other features, boxplot analysis based on interquartile ranges (IQR) was used. Observations falling outside the range $$[Q1 – 1.5 \times IQR,\; Q3 + 1.5 \times IQR]$$ were considered outliers, where Q1 and Q3 are the first and third quartiles, and $$IQR = Q3 – Q1$$. This process identified and removed anomalous records from several parameters, resulting in a cleansed dataset of 6,276 samples for model development.

2.2. Correlation Analysis of Process Parameters

Understanding the linear relationships between the 18 process parameters helps identify redundancy. Pearson’s correlation coefficient matrix was calculated. The correlation coefficient between two parameters $$X_i$$ and $$X_j$$ is given by:
$$r_{ij} = \frac{\sum_{k=1}^{n} (x_{ik} – \bar{X}_i)(x_{jk} – \bar{X}_j)}{\sqrt{\sum_{k=1}^{n} (x_{ik} – \bar{X}_i)^2 \sum_{k=1}^{n} (x_{jk} – \bar{X}_j)^2}}$$
where $$n$$ is the number of samples. The resulting heatmap revealed expected correlations, such as the positive correlation between impurity elements (P, S) likely stemming from recycled charge materials, and the negative correlation between sand mill temperature and moisture. However, no parameters were found to be highly redundant, suggesting each carries unique information relevant to the casting part quality.

2.3. Dimensionality Reduction via Kernel Principal Component Analysis (KPCA)

To visualize the high-dimensional data and explore underlying structures, Kernel PCA was employed. KPCA maps the original data into a higher-dimensional feature space via a kernel function $$\phi(\cdot)$$ where nonlinear relations become linear. PCA is then performed in this new space. The covariance matrix in the feature space is:
$$C = \frac{1}{n} \sum_{i=1}^{n} \phi(\mathbf{x}_i) \phi(\mathbf{x}_i)^T$$
The principal components are found by solving the eigenvalue problem for the kernel matrix $$K$$, where $$K_{ij} = \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)$$. A radial basis function (RBF) kernel was used. The first three principal components captured 93.2% of the total variance. Visualizing the data in this 3D space (see the linked figure) provided intuitive insights: Gas Porosity defects tended to cluster along high values of the first principal component, Sand Inclusions along the second, and Cold Shuts showed some separation along the third. This visual clustering confirms that the process parameters contain discriminative information for different defect types in the casting part.

3. Building the Defect Classification Model

The core of the data-driven framework is a supervised learning model that classifies the quality state of a casting part based on its process parameters. Given the tabular nature of the data, ensemble tree-based methods are highly effective. A Random Forest (RF) classifier was selected for its robustness, high accuracy, and inherent feature importance measures.

The Random Forest algorithm constructs multiple decision trees during training. For each tree, a bootstrap sample (bagging) is drawn from the dataset, and at each node split, a random subset of features is considered. The final classification for a new casting part sample is determined by majority vote across all trees. The Gini impurity, often used to select splits, measures the likelihood of misclassification for a randomly chosen element from the node. For a node $$t$$ with class probability distribution $$p(i|t)$$ for class $$i$$, the Gini impurity is:
$$I_G(t) = 1 – \sum_{i=1}^{c} [p(i|t)]^2$$
where $$c$$ is the number of classes. The model was trained using 5-fold cross-validation to ensure generalizability.

Model performance was evaluated using standard metrics for multi-class classification. For each defect class $$i$$:

Recall (True Positive Rate, TPR_i): The ability to correctly identify all defective casting parts of type $$i$$.
$$TPR_i = \frac{TP_i}{TP_i + FN_i}$$
Precision (Positive Predictive Value, PPV_i): The correctness of positive predictions for defect type $$i$$.
$$PPV_i = \frac{TP_i}{TP_i + FP_i}$$
F1-Score_i: The harmonic mean of precision and recall.
$$F1_i = 2 \cdot \frac{PPV_i \cdot TPR_i}{PPV_i + TPR_i}$$
Overall Accuracy: The proportion of total casting parts (sound and all defects) correctly classified.

Here, $$TP_i$$ (True Positives) are casting parts correctly predicted as defect $$i$$, $$FP_i$$ (False Positives) are other casting parts incorrectly predicted as defect $$i$$, and $$FN_i$$ (False Negatives) are defect $$i$$ casting parts incorrectly predicted as another class.

Table 2 compares the performance of the Random Forest model against other common classifiers. The RF model demonstrated superior performance, achieving the highest accuracy and F1-score, making it the preferred model for this casting part analysis.

Table 2. Performance Comparison of Classification Models for Casting Part Defects
Classification Model	Accuracy (%)	Macro Avg. F1-Score	Macro Avg. Recall (%)	Macro Avg. Precision (%)
Random Forest (RF)	97.1	92.37	94.08	90.72
K-Nearest Neighbors (KNN)	95.8	90.25	92.52	88.08
Support Vector Machine (SVM)	92.3	77.51	74.40	80.90
Neural Network (NN)	94.3	86.84	91.44	82.68

4. Model Results and Interpretability

4.1. Classification Performance Analysis

The confusion matrix for the RF model revealed excellent discriminatory power. The recall rates for all four defect types exceeded 90%, meaning over 90% of actual defective casting parts for each type were correctly identified. This is critical for a foundry’s goal of minimizing the shipment of defective parts. The primary misclassification occurred between Gas Porosity and Cold Shut defects, and between various defects and the sound class, suggesting some overlapping parameter influence and the inherent difficulty in perfectly separating every instance of this complex casting part.

4.2. Feature Importance Analysis

A key advantage of the Random Forest model is its ability to rank the importance of input features. Importance was calculated as the mean decrease in Gini impurity across all trees in the forest contributed by splits on a given feature. A higher value indicates a greater influence on the model’s ability to separate the classes (defect types). The analysis provides data-driven insight into which process parameters most significantly affect the quality of this specific casting part. The top influential parameters for each defect type are summarized in Table 3.

Table 3. Top Influential Process Parameters for Different Casting Part Defects (Based on Gini Importance)
Defect Type	Rank 1 (Most Important)	Rank 2	Rank 3	Practical Interpretation
Gas Porosity	Inoculant Amount	Pouring Temperature	Silicon Content	Excessive inoculant and high temperature promote gas nucleation/pickup.
Sand Inclusion	New Sand Content	Sand Mill Temperature	Compactness	Sandy parameters dominate; poor sand system control leads to inclusion defects.
Cold Shut	Carbon Content	Pouring Temperature	Pouring Time	Fluidity-related parameters; low temp, high C.E., or short fill time cause mistruns.
Shrinkage	Magnesium Content	Pouring Weight	Bentonite Content	High residual Mg increases shrinkage tendency; mold rigidity (bentonite) is also key.

These findings align well with metallurgical and casting principles for this ductile iron casting part, validating the model. For instance, the strong link between inoculant amount, pouring temperature, and gas porosity is well-known. The prominence of sand-related parameters for sand inclusions is logical. The high importance of magnesium for shrinkage is critical for ductile iron, where excessive Mg severely increases shrinkage propensity. This analysis transforms the model from a “black box” into a decision-support tool, guiding process engineers on which parameters to scrutinize when a specific defect plagues the casting part.

5. Online Quality Prediction and Process Parameter Optimization

The trained model can be deployed in various practical scenarios to actively improve the production of the casting part.

5.1. Real-time Quality Prediction

The RF model can serve as an online soft sensor. Once all 18 process parameters for a given mold are logged (post-casting), the model predicts the defect probability distribution. The output is a vector of scores (from 0 to 1) for each class, representing the confidence level from the ensemble of trees. For example, a sample might yield: Sound: 0.03, Gas Porosity: 0.85, Sand Inclusion: 0.04, Cold Shut: 0.08, Shrinkage: 0.00. This indicates a high-risk casting part likely to have gas porosity, prompting focused inspection.

More powerfully, prediction can occur during the process. If a subset of parameters is fixed (e.g., sand parameters, chemistry), and others are yet to be determined (e.g., pouring temperature, inoculant amount), the model can run Monte Carlo simulations. The undetermined parameters are randomly sampled from their historical distributions, and the model is evaluated thousands of times. The resulting probability distribution for defects provides a risk assessment for the current planned settings for the upcoming casting part. For instance, if the planned inoculant amount is set to a high value of 70g, the simulation might predict a 85% probability of gas porosity, signaling an immediate need to adjust the plan.

5.2. Process Parameter Optimization via Monte Carlo Simulation

We can extend this idea to actively optimize process windows. Suppose we want to find the optimal pouring temperature for this casting part, given all other parameters are fixed at their nominal values and the inoculant is unfortunately set high at 70g. We define a search space for the mean pouring temperature, $$\mu_T$$, from 1385°C to 1415°C. For each candidate $$\mu_T$$, we assume the actual temperature follows a normal distribution $$N(\mu_T, \sigma_T^2)$$ with $$\sigma_T = 5$$°C. We then perform a Monte Carlo simulation:

For $$m=1$$ to $$M$$ (e.g., $$M=50,000$$) trials:
- Sample pouring temperature: $$T^{(m)} \sim N(\mu_T, 5^2)$$.
- Construct the full feature vector with $$T^{(m)}$$ and other fixed parameters.
- Input the vector to the trained RF model to get the defect probability vector $$\mathbf{p}^{(m)}$$.
Compute the average probability for each defect over all $$M$$ trials:
$$ \bar{P}(\text{Defect}_i | \mu_T) = \frac{1}{M} \sum_{m=1}^{M} p_i^{(m)} $$

By plotting $$\bar{P}$$ for each defect against $$\mu_T$$, we identify the temperature that minimizes the overall risk or the probability of a specific defect. The simulation might reveal that while a lower temperature reduces gas porosity risk for this casting part, it increases the risk of cold shuts. The optimal temperature is the one that best balances these competing risks, perhaps minimizing the weighted sum of defect probabilities based on their cost or severity.

6. Conclusion

This work successfully established a comprehensive data-driven framework for analyzing and improving the quality of a specific sand-cast steering bridge component. By integrating data collection, preprocessing, machine learning modeling, and simulation-based optimization, the approach moves beyond traditional quality control methods. The key findings are:

Effective Defect Prediction: A Random Forest classification model was built that can predict defects in the casting part with high recall (>90% for all types), providing a powerful tool for early warning and targeted inspection.
Actionable Process Insights: Feature importance analysis derived from the model quantified the impact of various process parameters on specific defects. This data-driven ranking aligns with and reinforces foundry engineering knowledge, offering clear guidance for process adjustments when defects occur.
Proactive Quality Management: The framework enables two crucial applications: (a) real-time quality prediction for individual casting parts, both post-process and in-process, and (b) simulation-based optimization of key process parameters like pouring temperature to find operating windows that minimize defect risk.

The methodology demonstrates the significant potential of leveraging production big data to transform the manufacturing of complex casting parts from a reactive, experience-driven practice to a proactive, data-informed science. Future work will focus on integrating this model into a real-time dashboard for plant floor engineers, expanding the approach to other critical casting parts, and exploring advanced deep learning models for even more nuanced pattern recognition in the data.