We have a dataset of 164 data, timber strength.

The data have some variability. This is expected because timber is a material naturally variable.

The focus here is Exploratory Data Analysis (EDA) of the dataset:

statistics

histograms

cdf

boxplot

IMPORT MODULES

At first, we import the main libraries that we will use

#------------------------

#Import modules

#------------------------

1 import numpy as np 2 from OpenAIUQ import Stat as sta In Line 1 we import numpy, the prefix for the module Stat is sta.

LOAD THE DATASET

#--------------------------------------------------------------

#load the dataset (164 data of timber strength)

#--------------------------------------------------------------

1 dataset=np.loadtxt('timber164.txt') 2 #timber is the object data1 collecting the dataset 3 timber=sta.data1(dataset) 4 data=timber.data 5 timber.disp_summary(content='reduced')

In Line 1 we load the dataset, like an ndarray (164,). In Line 3 we associate the object timber (type=sta.data1) to the dataset. To check the main statistical properties of the sample in Line 5 we use the method disp.summary of timber.

It is seen that the data range from min=17.98 MPa to a max=70.22 MPa. This implies that a good range analysis can be xx=[0, 80] MPa. The sample mean is m=39.32 MPa, the sample coefficient of variation v=24.02% confirms the high degree of variability. Skewness g1=0.53 shows that the data present dominant right tail. Sample kurtosis g2=3.61 (greater than Gaussian) shows that the tails are heavy. The pair of values (g1,g2) implies that highest values of timber are not unlikely.

It is also possible to have more full description of the features of the dataset

1 timber.disp_summary(content='full')

obtained choosing the option content=full.

For a visual representation of the dataset, we plot the histograms

HISTOGRAMS

#--------------------------------------------------------------

#plot histograms

#--------------------------------------------------------------

1 timber.plot_hist()

In Line 1 we plot the histogram with default values. Two main parameters of the method plot_hist of data1 are "stat" and "bins". Their default values are stat='count' and bins='auto'. The value stat='count' means that we are representing the number of occurrencies in each class. Other choices are stat='probability' (which represents the relative frequencies) and stat='density' which normalizes the histograms to a Probability Density Function (pdf). Moreover, the binning is automatic (default). To address a more informed binning choice we write the following code:

1 edges=np.arange(0,85,5) 2 timber.plot_hist(bins=edges,stat='probability')

5 #timber.fig is the figure

6 #timber.ax is the subplot 7 timber.ax.set_xlim(0,80) 8 timber.ax.set_xticks(edges) 9 timber.ax.set_xlabel('$f_{timber} \ MPa$'); 10 timber.ax.set_title('Timber strength n=164')

11 fig_hist5=timber.fig

In Line 1 we are defining the array edges, ranging from f=0 MPa to f=80 MPa with a step delta_f=5 MPa. It collects the edges of the bins. In Line 2 we are plotting again the histograms, this time with chosen values of bins and in terms of relative frequency. The method plot_hist of data1 returns as output the attributes fig and ax. We use the attribute timber.ax to define the limits of x-axis (Line 7), the ticks of x-axis (Line 8), the label of x-axis (Line 9) and the title of the figure (Line 10). In Line 11 we denote the figure fig_hist5 as the histogram with binning size of 5 MPa (associated to the object timber.fig)

This binning shows a lack of data in the class [60,65] MPa, so a larger binning size is also considered

1 edges=np.arange(0,88,8) 2 timber.plot_hist(bins=edges,stat='probability') 3 timber.ax.set_xlim(0,80) 4 timber.ax.set_xticks(edges) 5 timber.ax.set_xlabel('$f_{timber} \ MPa$'); 6 timber.ax.set_title('Timber strength n=164')

7 fig_hist8=timber.fig

The code is identical, with the only difference of the binning size (Line 1), which is now delta=8 MPa. The two histograms with different binning size convey the same qualitative description of the data: the mode is in the class [36,40] Mpa, the distribution is asymmetric toward the right tail, and highest values of timber strength (corresponding to quantiles p>95%) are likely. The histograms confirm the metrics of sample mean, standard deviation, skewness and kurtosis.

A more robust description of the dataset (not dependent upon any parameter, like binning size) is given by the empirical distribution, provided by the method plot_cdf_emp.

#--------------------------------------------------------------

#plot empirical cdf

#-------------------------------------------------------------- 1 timber.plot_cdf_emp() 2 #timber.fig is the figure

3 #timber.ax is the subplot

4 timber.ax.set_xlim(0,80) 5 timber.ax.set_ylim(0,1) 6 timber.ax.set_xlabel('$f_{timber} \ MPa$'); 7 timber.ax.set_ylabel('ecdf $f_{timber}$') 8 timber.ax.set_title('Timber strength n=164')

9 fig_ecdf=timber.fig

In line 1 we plot the ecdf, while in Lines 4-10 we edit the figure using the attribute ax of the object timber. Since here we are analyzing a material strength, it may be of interest to evaluate the characteristic value, which is the 5% quantile

#--------------------------------------------------------------

#Characteristic value

#--------------------------------------------------------------

1 timber.get_quantile(0.05) 2 print('f_k= ',timber.Q,' Mpa')

In Line 1 we evaluate the quantile 5% of the dataset through the method get_quantile(p) of sta.data1. Its value is stored in the attribute Q of sta.data1. In Line 2 we print the value of the characteristic value, collected in timber.Q.

Another used representation is the boxplot provided by the method plot_box of sta.data1

#--------------------------------------------------------------

#-------------------------------------------------------------- 1 timber.plot_box()

2 #d.fig is the figure

3 #d.ax is the subplot 4 timber.ax.set_ylabel('$f_{timber}$')

5 fig_boxplot=timber.fig

The minimum and maximum values of the box are the lower interquartile Q1 (p=25%) (collected in timber.Q1) and the upper interquartile Q3(p=75%) (collected in timber.Q3). The central line of the box is the median Q2(p=50%) (collected in timber.Q2). The boxplot describes also the limits of the dataset min=Q1-1.5*(Q3-Q1) and max=Q1+1.5*(Q3-Q1). Values beyond these lines are considered outliers. In Lines 4 we edit the label of the y-axis of the figure using the attribute ax of the object timber.