We have a dataset of 1030 data of concrete, downloaded from UCI repository https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength . It collects 8 input features and 1 target feature:
Cement: -- quantitative -- kg in a m3 mixture -- Input Variable
Blast Furnace Slag -- quantitative -- kg in a m3 mixture -- Input Variable
Fly Ash -- quantitative -- kg in a m3 mixture -- Input Variable
Water -- quantitative -- kg in a m3 mixture -- Input Variable
Superplasticizer -- quantitative -- kg in a m3 mixture -- Input Variable
Coarse Aggregate -- quantitative -- kg in a m3 mixture -- Input Variable
Fine Aggregate -- quantitative -- kg in a m3 mixture -- Input Variable
Age -- quantitative -- Day (1~365) -- Input Variable
Concrete compressive strength -- quantitative -- MPa -- Output Variable
The focus of this tutorial (Class of "Machine Learning in Civil and Architectural Engineering", Aarhus University, Fall 2023) is:
Exploratory Data Analysis (EDA) of the dataset:
Linear Regression
kNN
Model comparison
At first, we import the main libraries that we will use:
IMPORT MODULES
1 import numpy as np 2 import pandas as pd 3 import matplotlib.pyplot as plt 4 from OpenAIUQ import Stat as sta 5 from OpenAIUQ import knn_regression as knnreg 6 from OpenAIUQ import ML_regression as mlreg 7 from sklearn.preprocessing import StandardScaler 8 from sklearn.linear_model import LinearRegression 9 from sklearn.neighbors import KNeighborsRegressor
In Lines 4-6 we import from OpenAIUQ the modules data_analysis (prefix data), knn_regressor (prefix knnreg) and ML_regression (prefix mlreg).
LOAD THE DATASET (1030 data of concrete)
1 df = pd.read_excel("Concrete_Data.xls") 2 dataset=np.array(df) 3 del df
4 features_all = ['cement','blast','ash','water','plastic', 'coarse','fine','age','strength'] 5 dataset=sta.data2(dataset,features=features_all)
In Line 1 we load the dataset (1030 data of concrete density), type=dataframe. Later we convert first into a 2darray (Line2) and later (Lines 4-5) into the object type=data2 of Stat.
1 features=features_all[:-1] 2 feature_target=features_all[-1] 3 X=np.array(dataset.dataset_df[features]) 4 y=np.array(dataset.dataset_df[feature_target]) 5 reg=mlreg.reg(X,y,features=features)
In Lines 1-4 we build the matrix of the input features, and the target vector y, both data type=array. In Line 5 we introduce the object reg that will be used for model selection of different machine learning models.
REMOVE TEST SET
1 test_ratio = 0.25 #percent of test set, i.e. test set=25% #random_state=42 (or other number different from None) provides deterministic #sequence 2 Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=test_ratio, random_state=42)
In Line 2 we split the dataset in training set (Xtrain and ytrain) and test set (Xtest and ytest). The test set is chosen equal to the 25% of the full training data (denoted by parameter test ratio, Line 1). We chose random state=42 for replicability of the results. Xtrain is an array (772,8) and ytrain is array (772,). Correspondingly Xtest is (258,8), while ytest is (258,).
EXPLORATORY DATA ANALYSIS
1 concrete=data.data2(np.c_[Xtrain,ytrain],features=features_all)
2 concrete.disp_summary(output='no') 3 concrete.summary
4 concrete.plot(pair='yes') 5 concrete_pairplot=concrete.fig
In Line 1 we build the object concrete (type=data2) over the training data. The main statistical properties of the data training are displayed through the method disp.summary (Lines 2-3). In Lines 4-5 we build the figure pairplot of seaborn over the training data. The attribute concrete.summary and the figure concrete_pairplot provide a main picture of the training data.
From concrete.summary it is seen that the concrete strength ranges from min=2.33 MPa to max=82.59 MPa. This implies that a good range of analysis for the strength can be x=[0, 90] MPa. The sample mean is m=35.81 Mpa, with coefficient of variation v=46.61% showing high degree degree of variability. It is not fully clear the reason of this great dispersion around the mean value. However, concrete_pairplot shows that the strength values have been measured:
in different time slots ( shown from the data aligned vertically in the scatter plot age-strength)
in different conditions, as shown from the histograms blast, ash, plastic (where there several zero entries in the features) and scatter plot blast-strength, ash-strength, plastic-strength.
Therefore the samples are not IID (independent and identically distributed). The sample skewness is g1=0.41 which implies some asymmetry toward the right tail, while the sample kurtosis g2=2.68 shows that extreme values are not likely.
1 cement=concrete.d[0] 2 blast=concrete.d[1] 3 ash=concrete.d[2] 4 water=concrete.d[3] 5 plastic=concrete.d[4] 6 coarse=concrete.d[5] 7 fine=concrete.d[6] 8 age=concrete.d[7] 9 strength=concrete.d[8]
In Lines 1-9 we build objects type=data1 from concrete data for more detailed statistical description of the features. It may be for example of interest to analyze the probabilistic distribution of the target.
TARGET TRAINING DATA
1 strength.disp_summary() 2 strength.plot_hist(bins=np.arange(0,85,5)) 3 strength.ax.set_xlabel('strength [MPa]'); 4 strength_hist=strength.fig
5 print(strength.mean)
6 print(strength.std)
In Line 2 we print the values of sample mean and standard deviation of the target (coinciding with the strength, see Line 1), i.e. m=35.90 MPa and s=16.78 MPa), while in Line 3 we plot the histogram with classes ranging from f=0 MPa to f=80 MPa, bin size=5 MPa.
It is also of interest to analyze the probability distribution of the target.
1 yrv_train=sta.dist('Weibull') 2 yrv_train.MLEfit(strength.data) 3 yrv_train.set_domain(xx=[0,90]) 4 yrv_train.prob_plot_w4(strength.data)
In Line 1 we fit the data of the targer with Weibull distribution. In Line 2 we find the parameters of the distribution through the Maximum Likelihood Estimation (MLE), while in Line 4 we plot the outcomes of the model fit:
top left: probability plot, underlining the central part of the distribution
top right: q-q plot, underling the tails of the distribution
bottom left: Weibull pdf compared with histograms
bottom right: Weibull cdf compared with empirical cdf
It is seen clearly in this case the Weibull distribution is a very good fit for the target training data.
CORRELATIONS
1 for i in range(len(features)): 2 concrete.plot(x0=i,y0=8,regression='yes',hist='yes')
3 concrete.plot_corr(heatmap='yes',output='no') 4 concrete.R_df.round(3)
In Lines 1-2 we plot the jointplot of seaborn for each pair feature-target, including scatter plot, histograms and regression lines. The method concrete_plot is within a loop in order to consider the scatter plot between the strength and each input feature. Here x0=i means feature i (i.e. columns i+1) as x-axis, and y0=8 means feature 9 (i.e. strength)
These plots are integrated with the values of the matrix of correlation (Lines 3-4).
Although the coefficient of correlation defines only a linear relationship between target and input features, the trend can well be detected. It is seen so that the strength increase with the cement content, while decreases with the water content, as expected from the physics of the problem. High correlation values of the strength are with cement content (positive, +0.50), water (negative, -0.29). It appears that linear regression does not describe properly the correlation with age, since this is clearly a time-dependent phenomenon.
NORMALIZATION
#Normalization #mean and variance are evaluated over the training set #the transformation is pursued over training and test set 1 scaler = StandardScaler() 2 scaler.fit(Xtrain) 3 Xtrain = scaler.transform(Xtrain) 4 Xtest = scaler.transform(Xtest) 5 concrete_norm=sta.data2(np.c_[Xtrain,ytrain],features=features_all) 6 concrete_norm.disp_summary(output='no') 7 concrete_norm.summary.round(3)
Linear Regression and kNN regression are ML models grounded on the definition of distance (between points). Therefore, to take into account the contributions of the different features, normalization is needed. In Lines 1-2 normalization is developed using the data of the training set (we cannot use the data of the test set, which is unknown for definition). In Line 3 and Line 4 the training set and the test set are normalized, respectively. To show the normalization effects, in Lines 6-7 the statistics of the normalized training data are shown. It is seen that all the features have zero mean and unit standard deviation.
VALIDATION SET
#======================================= #Build VALIDATION SET #======================================= 1 val_ratio = 0.33 #percent of validation set with respect to training set #i.e. 0.333*0.75=0.25, validation set=25% #It is chosen a given validation set 2 Xtrain_small, Xval, ytrain_small, yval = train_test_split(Xtrain, ytrain, test_size=val_ratio, random_state=42) #Xtrain_small: small training set (50% data) #Xval: validation set (25% data) #ytrain_small: target, small training set #yval: target, validation set
In Line 2 we split the training set in small training set and validation set (Xtrain_small and ytrain_small) and validation set (Xval and yval). The validation set is chosen equal to the 25% of the full training data (denoted by parameter validation ratio, Line 1, since 0.33*0.75=0.25). We chose random state=42 for replicability of the results. Xtrain_small is an array (517,8) and ytrain_small is array (517,). Correspondingly Xval is (255,8), while yval is (255,).
MODEL 1, LINEAR REGRESSION: TRAINING-VALIDATION SET
1 lin=LinearRegression() 2 lin.fit(Xtrain_small, ytrain_small)
In Line 1 we introduce the object lin as a LinearRegression of sklearn. In Line 2 we evaluate the parameters trained on the small training set (50% data). A way to check the prediction capabilities of the model is the learning curve. It states how the learning performance of the machine learning model increases as we give it more training examples.
#................................. #%%learning curve #................................. 1 learning_points=[1,Xtrain_small.shape[0]] 2 reg.plot_learning_curve(lin,Xtrain_small, Xval, ytrain_small, yval, learning_points) 3 reg.ax.set_xlabel('samples') 4 reg.ax.set_ylim(0,20) 5 reg.ax.set_title('linear') #model #Xtrain_i=small training set (50%) #Xval_i=validation set (25%) #ytrain_i=actual target, small training set #yval_i=actual target, validation set #learning_points=points of the small training set
In Line 1 we introduce the number of learning points. They are the number of points used for training. They range from 1 to the number of data point of the training data (here 517). In Line 2 we plot the learning curve.
The picture shows the good tradeoff of the linear regression in terms of bias-variance, since after only 150-200 training data the validation RMSE is converging to the training RMSE. It is however of interest to understand the behavior of the predictive model, to check if it satisfies the physical behavior of the target feature (concrete strength, in this case)
1 reg.plot_train_test_w2(lin,Xtrain_small,Xval,ytrain_small,yval,x0=0) 2 reg.ax1.set_xlabel('cement'); 3 reg.ax2.set_xlabel('cement');
In Lines 1-3 we plot the scatter plot of the training set (50% data) and validation set (25% data) together with the regression lines. The parameter x0=0 means that we are considering the feature 1 (column indexed by 0) as x-axis of the plot.
The RMSE_train=10.61 while RMSE_val=10.30, this shows that the linear regression has good tradeoff bias-variance. Although RMSE_val=10.30 MPa is high value in absolute terms (also considering that the sample mean of the strength is m=35.81 MPa), we note that the sample standard deviation is s=16.78 MPa. This implies that RMSE is within the range of the standard deviation of the training data. The figures confirm the findings of the learning curve.
A major drawback of the training-validation set is that may be affected by a poor choice of the training/validation set. An alternative in regard is represented by the cross-validation.
With respect to the split training-validation set, the CV has the following advantages:
Avoids bad/good choice of a single training & test set
provides sensitivity of model to data
use data more efficiently
1 reg.cross_validation([lin],Xtrain,ytrain,model_names=['lin'])
In Line 1 we apply cross-validation to the linear regression. Note that it is applied to the whole training set (75% data). For default 10 folds are considered. At the i-th split the fold=i is used as validation set, while the remaining folds are used for training. It is seen that the maximum RMSE is 12.5 MPa (split=3), which is below the standard deviation of the target.
MODEL 2, KNN REGRESSION: TRAINING-VALIDATION SET
1 knn=knnr.knn()
2 knn.grid(Xtrain,ytrain,param_iter=np.arange(1,21,1),seed=42,plot='yes')
3 kopt=knn.k
In Line 1 we instance the object knn the module knnr (knn regression)of OpenAIUQ. In Line 2 we plot a grid (different values of k) through the method grid. The parameter plot='yes' shows the plot, while the optimal k is the attribute knn.k (Line 3). In this pair training/validation set (seed=42) the optimal parameter is k=2.
Of course, changing pair training/validation set, also the hyperparameter k will change.
One strategy for choosing the hyperparameter is running N random split training/validation sets and plotting the histograms of the obtained k.
1 knn.k_hist(Xtrain,ytrain,samples=1000,plot_grid='no')
Following this exploratory analysis for the model selection (choice of the hyperparameter k) we explore two different knn models: knn2 (knn with k=2) and knn4 (knn with k=4)
#-------------------------- #knn - validation set #-------------------------- 1 knn2 = KNeighborsRegressor(n_neighbors=2) 2 knn2.fit(Xtrain_small, ytrain_small) 3 knn4 = KNeighborsRegressor(n_neighbors=4) 4 knn4.fit(Xtrain_small, ytrain_small) 5 reg.rmse_train_test_models([knn2,knn4],Xtrain_small,Xval,ytrain_small,yval, model_names=['knn2','knn4'])
In Line 1 we instance the object knn2 the module KNeighborsRegressor of scikit-learn, choosing k=2. In Line 2 we fit the parameters of the chosen knn regressor. In Lines 3-4 we describe the bject knn4 as a knn regressor with k=4. In Line 5 we print the RMSE errors of the chosen models
| RMSE train | RMSE validation |
knn2 | 5.44 | 9.37 |
knn4 | 7.22 | 9.58 |
The results show that knn4 provides a good tradeoff bias-variance, while knn2 (as expected) presents overfitting. To check the trend, we consider the learning curves of the models.
1 learning_points=[2,Xtrain_small.shape[0]]
2 reg.plot_learning_curve(knn2,Xtrain_small, Xval, ytrain_small, yval, learning_points)
3 reg.ax.set_ylim(0,20)
4 reg.ax.set_xlabel('samples')
5 reg.ax.set_title('knn2')
6 learning_points=[4,Xtrain_small.shape[0]]
7 reg.plot_learning_curve(knn4,Xtrain_small, Xval, ytrain_small, yval, learning_points)
8 reg.ax.set_xlabel('samples')
9 reg.ax.set_title('knn4')
10 reg.ax.set_ylim(0,20)
In Lines 1-5 and Lines 6-10 we plot the learning curves of knn2 and knn4, respectively. Note that for knn4 the minimum number of data points to be considered is 4, since k=4 neighbors. The results of the table are confirmed by the learning curves. The validations performances of the two models are similar, but knn2 is overfitting.
Model building is developed through cross-validation
1 reg.cross_validation([knn2,knn4],Xtrain,ytrain,model_names=['knn2','knn4'])
Results show that except for splits 2 and 7, the model knn4 outperforms the model knn2. So, model knn4 is our choice.
MODEL COMPARISON, LINEAR REGRESSION AND KNN4: TRAINING DATA
1 reg.cross_validation([lin,knn4],Xtrain,ytrain,model_names=['lin','knn4'])
2 reg.rmse_train_test_models([lin,knn4],Xtrain_small,Xval,ytrain_small,yval,model_names=['lin','knn4']) 3 reg.plot_train_test_models_w2([lin,knn4],Xtrain_small,Xval,ytrain_small,yval, points=500,x0=0,model_names=['lin','knn4'])
In Line 1 the two models (linear regression and knn4) are compared through cross-validation.
It is seen that knn4 outperforms linear regression for each split. The RMSE of knn4 is in the range 7-10 Mpa, while linear regression RMSE is around 9-12 This from statistical point of view. But it can be of interest to understand the physical behavior of the two models. Thus, in Line 3 we consider a split training/validation set (seed=42, x0=0 meaning cement/strength)
Both models have clear physical violation beyond the region of the available data.
Linear regression may predict infinite strength for very high values of cement content
knn4 predicts plateau values for cement<100 Kg/m3 and cement>500 kg/m3. Moreover the knn regressor is highly fluctuating in the data region.
MODEL BUILDING, LINEAR REGRESSION AND KNN4: WHOLE TRAINING DATA (75%)
#------------------- #lin - training set #------------------- 1 lin = LinearRegression() 2 lin.fit(Xtrain, ytrain) #------------------- #knn - training set #------------------- 3 knn4 = KNeighborsRegressor(n_neighbors=4) 4 knn4.fit(Xtrain, ytrain)
After the selection of the models (linear regression and knn4) the models are re-trained over the whole training set. See Lines 1-2 for linear regression, and Lines 3-4 for knn4.
1lin_pred=lin.predict(Xtrain) 2 knn4_pred=knn4.predict(Xtrain) 3 lin_pred_d1=sta.data1(lin_pred) 4 knn4_pred_d1=sta.data1(knn4_pred) 5 ytrain_d1=sta.data1(ytrain) 6 features_qq=['ypred','ytrain'] 7concrete.qq_plot_models([lin_pred_d1,knn4_pred_d1],ytrain_d1,model_names=['lin','knn4'],features=features_qq)
In Line 1 and Line2 we evaluate the predictions of models linear and knn4, respectively, evaluated over the whole training set. lin_pred and knn4_pred are array of order (Xtrain,). In Lines 3-4 we convert them into objects data1. In Line 5 we convert the array ytrain into object data1.
It is so possible to apply the method qq_plot_models. Here, the dataset lin_pred_d1 and knn4_pred_d1 are both compared with the training target ytrain_d1. In the q-q plot we consider the empirical distributions of the data, and for each dataset (lin_pred, knn4_pred, ytrain) we evaluate the quantiles for different values of probabiliy, from p=0 to p=1 with steps delta=0.01. Each point of the plot is the pair (quantile_model, quantile_ytrain).
The q-q plot confirms the findings of the cross-validation, whereas knn4 outperforms lin. The q-q plot more specifically shows that knn4 provides better accuracy in the left tail of the distribution, which is of greater importance for the strength.
MODEL SELECTION, LINEAR REGRESSION AND KNN4: TEST SET (25%)
1 features_qq=['ytrain','ytest'] 2 ytrain_d1=sta.data1(ytrain) 3 ytest_d1=sta.data1(ytest) 4 concrete.qq_plot(ytrain_d1,ytest_d1,features=features_qq)
At first we check the coherence train/test set, to see if their data are generated from the same unknown joint distribution. In Lines 2-3 we build the objects ytrain_d1 and ytest_d1 (type data=data1) and in Line 4 we represent the corresponding q-q plot
At least with reference to the target, there is no doubt that the test set can represent a good test for the chosen models trained over the training set.
1 reg.rmse_train_test_models([lin,knn4],Xtrain,Xtest,ytrain,ytest,model_names=['lin','knn4']) 2 reg.plot_train_test_models_w2([lin,knn4],Xtrain,Xtest,ytrain,ytest, points=500,x0=0,model_names=['lin','knn4'])
| RMSE train | RMSE test |
lin | 10.42 | 10.08 |
knn4 | 6.72 | 8.86 |
In Line 1 we evaluate the RMSE of both models, linear regression and knn4. It is seen that they confirm the trend of the cross-validation. However, knn4 is overfitting more than linear regression. In Line 2 we plot the models in the training/set tests to check their physical behavior. Also in this case, the results of the training data are confirmed.
1 lin_pred=lin.predict(Xtest) 2 knn4_pred=knn4.predict(Xtest) 3 lin_pred_d1=sta.data1(lin_pred) 4 knn4_pred_d1=sta.data1(knn4_pred) 5 features_qq=['ypred','ytest'] 6 concrete.qq_plot_models([lin_pred_d1,knn4_pred_d1],ytest_d1,model_names=['lin','knn4'],features=features_qq)
In Lines 1-2 we evaluate the predictions of the two models with respect to the test set. In Lines 3-6 we represent the q-q plot between the target test and the predictions of the two models.
It is noted that also for the chosen test set, knn4 outperforms linear regression in the left tail of the concrete strength.
Comments