1.Introduction
Sampling technique is used to estimate uncertainty inherent in population. As the type of probability distribution of population is unknown in most sociology and engineering problems, the expectation of random variable or the occurrence probability of subset can be directly calculated through sampling. Also, in a case where the type of probability distribution of population is known or properly assumable, it is possible to calculate the occurrence probability of subset based on the distribution parameters of population estimated through sampling.
It is often difficult to define the probability distribution of inputs entered in a system to define problems in sociology and engineering. Even if the inputs are fortunately known as normal distribution, the outputs produced by a system may have other types of distribution than normal distribution if the correlation between random variables is nonlinear or the system itself has a nonlinear feature. Nevertheless, in probabilistic design for nuclear power plants associated to ASCE 43 [2], the probability distribution of outputs produced by the system is properly assumed in order to increase the efficiency in calculation through an allowance for some degree of error [1].
To estimate the mentioned uncertainties, one can use well known techniques such as 1) Random Sampling [3, 4], 2) Importance Sampling [5], and 3) Latin Hypercube Sampling [6]. As Random Sampling (RS) uses samples selected randomly among population, it has a disadvantage of a lengthy calculation time because a large number of samples are required especially in a case where a probability to be calculated is located in the tail of probability distribution. When there is a need to calculate the probability in the tail, Importance Sampling is a method for selecting, randomly in the tail area, a far smaller sample than RS. Latin Hypercube Sampling (LHS) is a method for dividing the whole interval of random variables into several intervals with the same probability area and for selecting samples from each interval. LHS can estimate the probability distribution approximately using a far smaller sample than RS, and is effective in estimating the distribution parameters in a case where the type of the probability distribution of random variable is known or properly assumable. For this reason, LHS is preferred in probabilistic structural design for an efficient calculation [1].
This study proposes extension of LHS to avoid the necessity of using intervals with the same probability area. In a case where intervals with different probability areas are used, it is necessary to make a correction using proper weight function. This method is called Weighted Latin Hypercube Sampling (WLHS). This study establishes equations necessary to apply weight function to WLHS, and verifies WLHS through numerical examples.
The proposed WLHS is also applied to seismically isolated structures in nuclear power plants. Huang et al. [1] considered the uncertainties using LHS for a study on seismic performance evaluation of seismically isolated structures in NPPs. The same numerical example as Huang et al. [1] is used in this paper with the site-specific ground motions of the nuclear power plant in Diablo Canyon, which is western region in the USA. In this application, clearance-to-stops (CSs) calculated using LHS proposed by Huang et al. [1] and WLHS proposed in this paper, respectively, are compared to investigate the effect of choosing different sampling techniques.
2.Weighted Latin Hypercube Sampling (WLHS)
In order to see how to extend LHS to propose WLHS, it is helpful for us to review on LHS. Based on this review, weight function and distribution parameters can be defined in a reasonable way.
2.1.Latin Hypercube Sampling (LHS)
If the number of samples is the same as that of population, an exact probability distribution can be found. However, in real situation, probability distribution can be estimated using samples far smaller than population in quantity. In Latin Hypercube Sampling, the total probability area created between probability density function and horizontal axis (random variable axis) is divided into n intervals with the same areas, and n samples are created through a random selection within each interval or a selection of the probability-area center in each interval. The probability- area center refers to the value of random variables at the center of probability area in each interval.
If the population of random variable X follows normal distribution, sample mean and unbiased sample variance can be estimated as follows, based on samples selected using LHS. The well-known derivation of these equation is shown in Appendix as a reference.
Using sample mean and sample variance calculated by these equations, the mean and variance of population can be approximately estimated. Furthermore, using these calculated distribution parameters, the normal probability distribution of population can be approximated.
If Y has log-normal distribution, ‘X=lnY’ follows normal distribution, and , which is the sample mean of X, is defined as follows:(3)
where Yi refers to a sample selected from the ith interval of log-normal distribution, and θ is the sample median of the log-normal distribution. The Equations above can be rewritten for θ as follows:(4)
The sample lognormal variance of Y, , is the same as the sample variance of lnY, so that it can be defined as follows:(5)
2.2.Weighted Latin Hypercube Sampling (WLHS)
In the afore-mentioned Latin Hypercube Sampling, each interval is set so that probability area of each interval is the same, and one sample is selected from each interval. However, this method of setting the sample intervals need to have more flexible way because of the following two reasons. First, sample data may not exist in an interval set by LHS in a case where the samples are collected from the un-controlled observations. Second, there is possibility that a sample interval can be optimized to accurately estimate the probability distribution of population using a smaller number of samples than LHS.
2.2.1.Weight function for a single random variable
In WHLS, the total probability area created by probability density function and horizontal axis (random variable axis) is divided into n intervals, and samples are selected from each interval as similarly as in LHS. These intervals need not have the same probability area, unlike LHS. In LHS method, the total probability area is divided into n intervals with the same areas, and n samples are selected from these intervals. Therefore, the probability of selecting each sample is the same as . On the other hand, in WLHS method, the probability area in each interval is different from each other. Therefore, probabilities of selecting samples in these intervals are different from each other. As mentioned above, the probability area (Ai) of each interval is the same as probability of selecting a sample from that interval, so that it is used as a weight function. This sampling method is referred to as Weighted Latin Hypercube Sampling (WLHS). The weight defined this way is represented by Eq. (1).
where i refers to a natural number ranging from 1 to n, the number of sample. should be satisfied.
2.2.2.Weight function for two or more random variables
If the probability distribution of population is defined by two random variables X and Y, weight for a sample selected in the jth sample interval (a,b) of random variable X is defined as the probability area of the jth interval, and weight for a sample selected in the kth sample interval (c,d) of random variable Y is defined in the same way. The probability (or weight wi ) that the jth sample of X and the kth sample of Y can be selected at the same time is the same as probability volume created by joint probability distribution in joint sample region of X and Y. This is represented by Eq. (7).
where if two variables X and Y are mutually independent, the equation above is simplified as follows. Furthermore, in a case where random variables are more than two, weight can be defined in the same way.(8)
2.2.3.Sample mean and sample variance
In the case of using WLHS, the probability of selecting each sample Xi is the same as weight wi as mentioned above. Considering this, weighted sample mean and unbiased weighted sample variance can be defined as follows. The derivation of these equations is described in Appendix. It is assumed that random variable X has normal distribution in derivation for unbiased weighted sample variance.
If weight wi in the two equations above is , they are simplified to those equations defined in LHS. Eq. (10) is the same as the equation defined by Wiki [7], GNU Scientific Library [8] and Madansky [9] for unbiased weighted sample variance. On the other hand, the derived equation for weighted sample variance by Copper et al. [10] fails to meet the unbiased condition of sample variance. Equations failing to meet unbiased condition have been used for some programs and research results without recognition of such confusion [11].
Equations derived above for WLHS can be easily extended to log-normal distribution. Let Y has log-normal distribution. Then, ‘X=lnY’ has normal distribution, so that sample mean of X is defined as follows:(11)
where Yi is a sample selected from the ith sample interval of log-normal distribution, and θ is sample median of population with log-normal distribution and defined as follows:(12)
The sample lognormal variance of population, , is the same as the sample variance of lnY, . Therefore, is defined as follows:(13)
Sampling intervals in WLHS can be selected more flexibly than in LHS, in consideration of users’ conveniences or the limits of given data. In other words, the mean and variance of population with normal distribution (or lognormal distribution) can be approximated more flexibly using sample mean (or median) and sample variance (sample lognormal variance) calculated according to the equations presented in this section. Based on this calculation, tail values of the normal distribution of population can also be approximated using the approximated distribution parameters.
3.Numerical examples
How accurately WLHS can estimate the probability distribution of population is reviewed according to procedures such as 1) to assume population whose probability distribution is known, 2) to select samples by different sampling methods such as Random Sampling, Latin Hypercube Sampling and Weighted Latin Hypercube Sampling, and 3) to compare how accurately each sampling method can estimate the distribution parameters. Sampling intervals and weights are properly assumed in WLHS as described in the previous section.
3.1.Assumed probability distribution of population
It is assumed that population has standard normal distribution. In other words, the mean and standard deviation of population are assumed as 0 and 1, respectively. Figs. 1 and 2 show probability density function (PDF) and cumulative distribution function (CDF) of that probability distribution. Samples are selected from the population using Random Sampling, LHS and WLHS. Sample mean and standard deviation are calculated, and their accuracy is verified through their comparison with the known mean and standard deviation of population. Note that since it is shown in Section 2 how to estimate distribution parameters for an example having log-normal distribution, one can easily conduct analysis for it. However, the example having standard normal distribution is considered in this paper.
3.2.Weighted Latin Hypercube Sampling
Samples are selected from population using Random Sampling, Latin Hypercube Sampling and Weighted Latin Hypercube Sampling. In Random Sampling, samples are selected randomly from standard normal distribution. In Latin Hypercube Sampling and Weighted Latin Hypercube Sampling, sample intervals are set according to the procedures mentioned in Section 2, and samples are selected at the probability- area center in each interval. To select samples in WLHS, sample weight wi should be defined according to Section 2.2. As described in Eq. (6), the weight can be considered as the probability area of each sampling interval, Ai. Sampling interval can be easily defined by making the probability area of each interval of random variable equal to weight, wi . The weight wi can be defined depending on distribution of collected data, or empirical intuition of researchers.
In this example, weight wi is chosen to have the following definition.
where α is a positive real number, and normalization coefficient is defined to meet is 0, is established, so that the same samples as in LHS are selected. In this example, the results of WLHS produced with the change of α are compared with the results from RS and LHS. As each sampling interval can be defined using that weight, the center of probability area of sampling interval is selected as sample point according to the following detail procedures.
-
1) Weight wi is arranged in the order to be applied to desired locations from the left side to the right side of probability distribution. In this example, weight is selected to be symmetric with respect to the mean. Fig. 3 shows weight wi under the conditions of α=0.5 and the number of samples, n=10. It can be found that the weights of WLHS are larger in the middle and smaller in both tails than those of LHS.
-
2) The center of probability area of each sampling interval is located using cumulative distribution function. The center of probability area of the 1st sample interval is under the condition of cumulative distribution function , and that of the ith sample interval is under the condition of . As this example assumes that the population follows standard normal distribution, Fi is equivalent to standard normal cumulative distribution function Φ.
-
3) Sample value, Xi , at the center of probability area of sample interval is found using inverse cumulative distribution function. In other words, where F-1 refers to inverse cumulative distribution function. Since the population is assumed to follow standard normal distribution, F-1 is equal to Φ-1. Fig. 4 shows sample value and cumulative probability at the center of probability area of each sampling interval under the conditions of α=0.5 and the number of samples, n=10. Fig. 5 shows the cumulative probability of samples over their sample numbers. This figure shows that, although samples in LHS are separated by the same intervals in probability, the intervals in WLHS are not same each other and relatively longer in the center and shorter in both tails.
3.3.Estimation of distribution parameters
For the sake of convenient calculation in the case of many actual probability and statistics problems, probability distributions are frequently estimated on the assumption that population has normal distribution. This example assumes that population follows standard normal distribution. If the mean and variance of population can be approximately estimated using sample mean and sample variance, the probability distribution of population can be easily approximated by these sample distribution parameters.
Since all samples in LHS have the same weights, the sample mean and sample variance are estimated according to Eqs. (1) and (2) in Section 2.1, respectively. Likewise, since all samples in RS have the same weights, the same equations as in LHS are used. On the other hand, since samples in WLHS have different weights, the sample mean and sample variance are estimated according to Eqs. (9) and (10) in Section 2.2, respectively.
Figs. 6 and 7 show the sample mean and sample standard deviation that vary according to the number of samples under the condition of α=0.5. The sample mean under LHS and WLHS is exactly the same as population mean. The reason is that LHS and WLHS selected samples symmetrically in this example so that the sample mean could be maintained to be exact to the population mean. In the cases of LHS and WHLS, mean and standard deviation of population are estimated quite efficiently and accurately comparing to RS. As shown in these figures, large enough number of samples in RS are necessary to approximate the mean and standard deviation of population.
Figs. 8 and 9 show sample mean and sample standard deviation that vary according to the number of samples under the condition of α=1.0 respectively. Fig. 8 shows that the sample means under LHS and WLHS are exactly the same as population mean due to the above- mentioned reason. Fig. 9 shows that sample standard deviation under WLHS esti- mates population standard deviation less accurately compared to LHS. The reason is that, as samples are concentrated on both tails with the increase of α, sample standard deviation is overestimated compared with population standard deviation. This trend become more extreme with the increase of α. In the case of using WLHS, samples are relatively concentrated on both tails, and thus sample standard deviation can be estimated at larger than population standard deviation. It is necessary to conduct a study in future on how samples can be properly distributed considering this effect.
3.4.Asymmetric sampling
To estimate the distribution parameter of population in this example, samples were selected symmetrically with respect to the mean in the previous sections. It is rare that collected data given as samples are distributed symmetrically with respect to the mean. Therefore, it is interesting to see what results any asymmetric sampling will produce in the above-mentioned example. For asymmetric sampling, weights will be defined by Eq. (14) and with a sequential increase from the left tail in index number i. Fig. 10 shows weight ωi selected asymmetrically under the condition of α=1.0 and n=10. As a point moves from the left to the right on the graph, weight becomes lower. This means that samples are selected more densely in the right tail.
Figs. 11 and 12 respectively show sample mean and sample standard deviation that vary according to the number of samples under the condition of α=0.5. Fig. 11 shows that sample mean of LHS is exactly the same as population mean, but that sample mean of WLHS estimates population mean with a little error. As WLHS selects sample asymmetrically, sample mean is biased toward where samples are more concentrated. As shown in Fig. 12, LHS estimates population standard deviation more accurately than WLHS.
Figs. 13 and 14 respectively show sample mean and sample standard deviation that vary according to the number of samples under the condition of α=1.0. Fig. 13 shows that sample mean of LHS is exactly the same as population mean, but that sample mean of WLHS estimates population mean with a little error. This is attributed to the same reason as that of Fig. 11. In Fig. 14, LHS estimates population standard deviation more accurately than WLHS. Therefore, it is necessary to study in future on how to distribute samples effectively in WLHS in order to minimize the difference between the sample standard deviation and the population standard deviation.
4.Application to seismically isolated structures in nuclear power plants
Huang et al. [1] considered the uncertainties using LHS for a study on seismic performance evaluation of seismically isolated structures in NPPs. The same numerical example as Huang et al. [1] is used in this paper with the site-specific ground motions of the nuclear power plant in Diablo Canyon, which is western region in the USA. In this numerical example, clearance-to-stops (CSs) calculated using LHS proposed by Huang et al. [1] and WLHS, respectively, are compared to investigate the effect of using different sampling techniques.
4.1.Lumped-mass model for a seismically isolated structure
Huang et al. [1] represented behaviours of seismically isolated structures by a lumped-mass model as shown in Fig. 15, to which seismic isolation devices such as friction-pendulum bearing, low-damping rubber bearing, and lead-rubber bearing were applied. In this paper, among these three isolation devices, a lead-rubber bearing is selected to implement for the lumped-mass model. The lumped-mass model represents an upper structure as a lumped mass mb, which has two horizontal degrees of freedom and one vertical degree of freedom as shown in Fig. 15. The spring and damper are introduced to express stiffness and damping effect in each degree of freedom. The restoring force of nonlinear spring in two horizontal directions is expressed as a bilinear function , and the linear spring stiffness in vertical direction is expressed as constant kυ . The restoring force of nonlinear spring in horizontal direction can be represented by a bilinear function, which is defined with secondary horizontal stiffness k2, yield strength fy, and yield displacement uy as shown in Fig. 16. Yield strength can be alternatively expressed by the y-intercept Qd, and primary horizontal stiffness k1 can be calculated as dividing yield strength fy by yield displacement uy . The present paper employs OpenSEES (2016) to perform nonlinear response history analyses (RHA) using this lumped- mass model.
For known horizontal yield displacement uy , y-intercept Qd and horizontal natural period Td, one can easily determine the first and secondary horizontal stiffness k1 and k2 , respectively, and the yield strength fy . For known Tυ, one can easily determine the vertical stiffness kυ . In the seismic isolation system, we assume that the horizontal yield displacement is 25 mm and the vertical natural period Ty is 0.05 sec. Therefore, Qd and Td are considered as random variables in this numerical example. Nine isolation systems are considered by setting three cases of Td such as 2, 3, and 4 sec, and three cases of Qd such as 0.03 W, 0.06 W, and 0.09 W. Among these, only two examples, representing the smallest and the largest differences between CSs calculated by LHS and WLHS, are selected to be shown.
A probability distribution of properties of seismic isolation systems occur during manufacturing process and quality control for seismic isolation units. Scale factors are defined to represent this distribution as these are multiplied with the mean values of properties to generate samples. The mean value of the scale factor is one, and the standard deviation are defined according to cases where quality control is excellent or good [1]. If the properties under excellent quality control have 95% confidence within ±10% interval from the best-estimated value, it has a normal distribution with 0.05 of standard deviation. If the properties under good quality control have 95% confidence within ±20% interval from the best-estimated value, it has a normal distribution with 0.1 of standard deviation. Table 1 show 30 samples of the scale factor extracted using LHS and WLHS for cases whose quality control is excellent and good. In these numerical examples, we consider three properties, k2 , Qd and kυ, as random variables under excellent quality control, and α in Eq. (14) is assumed to be 0.5 for determining sample weights in WLHS.
4.2.Input ground motion
In this example, 30 sets of input ground motions are selected as same as Huang et al. [1], and matched with the site-specific design spectrum of nuclear power plants in Diablo Canyon, western region in the USA. The matched input ground motions are considered as Design Basis Earthquakes (DBE). A single set of the input ground motions consists of two horizontal ground motions and one vertical ground motion.
It is reasonable to make intensity of input ground motions different in each direction if input ground motions in both directions are produced from one Uniform Hazard Response Spectrum (UHRS). Huang et al. [1] introduced the ratio, Fh , of input ground motion in horizontal direction and the ratio, Fυ , in vertical direction to the geometric means. These ratios are considered as random variables, and their distributions are defined based on statistics of measured earthquakes. For Fh , the median value, θh, is 1.3 and the logarithmic standard deviation, βh, is 0.13. For Fυ , the median value, θυ, is 1.0 and the logarithmic standard deviation, βυ, is 0.18. To maintain the same geometric mean in horizontal directions, the strong-axis ground motion is defined as Fh times of the geometric mean while the weak-axis ground motion in the orthogonal direction times. Thirty samples of Fh (Table 2) are extracted using LHS and WLHS (α in Eq. (14) is assumed to be 0.5 for determining sample weights), respectively, and then the maximumminimum spectra compatible ground motions are defined by multiplying the samples of Fh or with 30 sets of the selected ground motions. Fig. 17(a), (b) show the distributed maximum-minimum spectra compatible ground motions of the design basis earthquake in horizontal directions determined by WLHS. The similar figures determined by LHS can be found in Huang et al. [1].
In the same way, thirty samples of Fυ (Table 2) are extracted using LHS and WLHS, respectively, and then the samples of ground motions are determined by multiplying the samples of Fυ with 30 sets of the selected ground motions. Fig. 17(c) show the distributed ground motions of the design basis earthquake in vertical direction determined by WLHS. The similar figures determined by LHS also can be found in Huang et al. [1].
4.3.Effect of WLHS on CS
The effect of sampling techniques on CS is discussed in this section. Since CS is defined as 90%-ile displacement under 150% DBE according to ASCE 4 draft [12], the DBE shown in Fig. 17 are amplified by 150% and then applied as input ground motions to the lumped-mass model in Fig. 15(b). Among the 9 cases of isolation systems mentioned in Section 4.1, only two examples, representing the smallest and the largest differences between CSs calculated by LHS and WLHS, are selected to be shown in this section. The CDF of displacements based on the isolation system with Td 2 sec and Qd 0.03W and isolation system with Td 3 sec, Qd 0.03W are shown in Figs. 18 and 19, respectively. Figs. 18(a) and 19(a) compare the CDF of sample displacements and the corresponding estimated CDF calculated by LHS and WLHS. Figs. 18(b) and 19(b) show the enlarged view of the right tails in the CDFs. Here, the blue points indicate samples acquired through LHS and the red points indicates samples through WLHS. The estimated CDF is shown with dotted lines by calculating sample medians and sample logarithmic standard deviations.
According to Fig. 18(b), the calculated CSs (90%-ile displacements based on the estimated CDFs) by LHS and WLHS are 1080 mm and 1076 mm, respectively, so that the choice of sampling techniques essentially does not make any difference. However, the choice of the sampling techniques makes 16% difference in CS since the calculated CSs by LHS and WLHS are 1589 mm and 1549 mm, respectively, as shown in Fig. 19(b). These results indicate that a choice of sampling techniques can cause significant difference in CS. Therefore, it is necessary to standardize the sampling technique for calculating CS.
Figs. 18(a) and 19(a) show that there are significant differences between the samples and the estimated CDFs. This occurs because the assumed lognormal distribution is different from the actual distribution of population, and/or because there is randomness in sampling procedure. This difference can be reduced by extended case studies and indepth reviews on the sampling procedure.
5.Conclusion
This study presented Weighted Latin Hypercube Sampling which is expansion of Latin Hypercube Sampling by applying weights to samples. WLHS allows more flexible way of selecting samples. The results are summarized as follows:
-
1) The WLHS, which is an extension of Latin Hypercube Sampling (LHS) to avoid the necessity of using sampling intervals having the same probability area, has been proposed. WLHS provides more flexible way on selecting samples by using sampling intervals having different probability areas.
-
2) To approximate distribution parameters of population, the equations and procedures for calculating sample mean and sample variance of normal distribution (or sample median and sample lognormal variance of log-normal distribution) were clearly described with consideration of applying sample weights.
-
3) Accuracy of WLHS estimation on distribution parameters is depending on the selection of weight function. By comparing distribution parameters estimated by WLHS with the results of RS and LHS through numerical examples, it is found that distribution parameters can be estimated quite accurately using proper weights.
-
4) In WHLS where samples are distributed symmetrically with respect to mean, population mean is found to be accurately estimated by sample mean. On the other hand, in a case where they are asymmetrically distributed, the accuracy of an estimation of population mean is found to be relatively low.
-
5) In WHLS where samples are distributed symmetrically with respect to mean, samples are relatively concentrated on both tails, so that sample standard deviation is overestimated compared to population standard deviation.
-
6) In WLHS where samples are distributed asymmetrically, LHS is found to estimate population mean and population standard deviation more accurately than WLHS.
-
7) There is a need for a further study on how to distribute samples properly in WLHS to estimate distribution parameters with sufficient accuracy.
-
8) A choice of sampling techniques between LHS and WLHS can cause significant difference in determining CS. Since it would be better to minimize this randomness caused by choice of sampling techniques, reviews on the sources of this differences should be performed in future study.
-
9) There are significant differences between the samples and the estimated CDFs. This occurs because the assumed lognormal distribution is different from the actual distribution of population, and/or because there is randomness in sampling procedure. This difference can be reduced by extended case studies and in-depth reviews on the sampling procedure.
-
10) Although ASCE 43 [2] requires the probabilistic design technique in NPP, the sampling method in the probabilistic design technique is not clearly mentioned. The sampling method proposed in the present study can be one alternative. In order to guarantee consistent design regardless of choice of sampling techniques, it is necessary to establish a consistent sampling technique based on numerous examples in future.