Open Journal of Mathematical Sciences
ISSN: 2523-0212 (Online) 2616-4906 (Print)
DOI: 10.30538/oms2021.0156
Joint influence of measurement errors and randomized response technique on mean estimation under stratified double sampling
Ronald Onyango\(^1\), Brian Oduor, Francis Odundo
Department of Applied Statistics, Financial Mathematics and Actuarial Science Jaramogi Oginga Odinga University of Science and Technology P.o Box 210, Bondo-Kenya.; (R.O & B.O & F.O)
\(^{1}\)Corresponding Author: assangaronald@gmail.com
Abstract
Keywords:
1. Introduction
Auxiliary variables are closely related to the survey variable and are used in a survey at the design and estimation stage to improve the efficiency of estimators of the finite population mean. The difference between the true value of a variable and the value recorded in a survey is referred to as measurement errors. Measurement errors are caused by memory loss, prestige bias, over-reporting, under-reporting, processing errors, and incorrect values from the respondent. In literature, most researchers assume that the data collected in a survey are error-free. However this is not the case, the problem of measurement errors is inherent in survey sampling.
In a survey, the researcher faces the problem of estimation of the finite population mean for a sensitive survey question with a social stigmatizing characteristic such as ''Have you ever had an abortion?'', ''Are you a drug addict?'' and ''Have you ever been infected with sexually transmitted diseases?''. Moreover, it is challenging to obtain the correct responses on such questions in personal interviews which involve direct questioning of the subjects because the respondent's privacy is unprotected. Consequently, this may result in measurement errors. Warner [1] proposed the Randomized Response Technique (RRT) which aims at reducing answer bias in a survey involving a sensitive variable through the protection of the privacy of the respondents. In the Randomized Response Technique (RRT), a scrambled variable that is independent of the survey and auxiliary variables are used in the estimation of the finite population means of a sensitive variable. The respondent is expected to provide a true response for the non-sensitive auxiliary variable and a scrambled response for the survey variable. The scrambled response is obtained by adding a random number to the true response of a sensitive question. The value added is unknown to the survey practitioners but the probability distribution of the scrambled response is assumed to be known.
The problem of estimation of the finite population mean for a non-sensitive variable using auxiliary variable under simple random sampling is addressed by Shalabh [2], Diwakar et al., [3] and, Yadav et al., [4]. Additionally, Gajendra et al., [5] used calibrated weights to propose ratio and regression type mean estimators for a non-sensitive variable under stratified random sampling.
The problem of estimation of the finite population mean for a sensitive variable based on Randomized Response Technique (RRT) under different sampling schemes is addressed by Eichhorn and Hayre [6], Gupta and Shabbir [7], Gupta et al., [8], Sousa et al., [9] and Tanveer and Housila [10].
Mushtaq et al., [11] and Mushtaq et al., [12] have proposed different estimators of the finite population mean for a sensitive variable using a non-sensitive auxiliary variable under stratified random sampling. The problem of estimation of the finite population mean under stratified two-phase sampling is discussed by Mushtaq et al., [12]. The joint influence of double sampling and the Randomized Response Technique (RRT) on the estimation of the finite population mean under simple random sampling is addressed by Mushtaq and Noor-Ul-Amin [13]. Additionally, the problem of estimation of the finite population mean for a sensitive variable in the presence of non-response based on the Randomized Response Technique (RRT) is discussed by Naeem and Shabbir [14]. Zahid and Shabbir [15] proposed a generalized class of estimators of the finite population mean using a non-sensitive auxiliary variable in the presence of non-response and measurement errors under simple random sampling and stratified random sampling.
Sadia [16] proposed generalized estimators of the finite population mean in the presence of measurement errors under simple random sampling and stratified random sampling. The performances of the proposed estimators were studied in the presence and absence of the measurement errors. Recently, Zhang [17] addressed the problem of mean estimation for a sensitive variable based on optional Randomized Response Technique (RRT) in the presence of non-response and measurement errors under simple random sampling and stratified random sampling.
Handling sensitive survey questions and measurement errors is a major challenge for survey practitioners especially when both occur simultaneously in a survey. The present study fills the existing gap in the literature on mean estimation for a sensitive variable using a non-sensitive auxiliary variable in the presence of measurement errors under stratified double sampling. Also, the combined effect of measurement errors and Randomized Response Technique (RRT) on estimators of the finite population mean is investigated.
The study considers an additive Randomized Response Technique (RRT) model in which the respondent adds a random number to the true answer of a sensitive question to give a scrambled response. Further, the probability distribution of the scrambling variable is assumed to be known by the survey practitioner. The proposed strategy assumes that measurement errors are present in both first and second-phase samples of stratified double sampling.
In the present paper, Section 2 gives a detailed description of the population under study. The ordinary mean estimator of the finite population mean for a sensitive variable is discussed in Section 3. Section 4 describes the properties of the proposed estimator of the finite population mean for a sensitive variable using a non-sensitive auxiliary variable in the presence of measurement error. In Section 5, members of the family of the proposed generalized estimator are discussed. The efficiency of the proposed estimator is studied theoretically in Section 6. Finally, a numerical analysis of the performance of the proposed estimator is done in Section 7.
2. Population description and notations
Consider a heterogeneous population \(U =1, 2\dots N\) of size \(N\) consisting of a survey variable \(Y,\) and auxiliary variable, \(X.\) The population is categorized into \(L\) homogeneous groups of sizes \(N_h\) each known as strata. In a survey, direct observations cannot be made on a sensitive variable with social stigmatizing characteristics hence the Randomized Response Technique (RRT) is used for obtaining unbiased estimates of the finite population parameters. Let \(S,\) be a scrambling variable that is normally distributed with mean 0 and variance \(S^2_{Sh}\). The respondent is expected to provide a true response for the auxiliary variable and a scrambled response for the sensitive variable. Let \(Z_{hi}=Y_{hi}+S_{hi}\), denote the \(i^{th}\) value of a scrambled response in \(h^{th}\) stratum. Further, let \(Z_{hi}\) and \(X_{hi}\) denote \(i^{th}\) value of \(Z\) and \(X\) respectively in \(h^{th}\) stratum. Additionally, let \({\overline{Z}}_h\) and \({\overline{X}}_h\) be the population means for \(Z\) and \(X\) respectively in \(h^{th}\) stratum. Further, let \(S^2_{Zh}\) and \(S^2_{Xh}\) be the population variances of \(Z\) and \(X\) respectively in \(h^{th}\) stratum. Let \(S_{ZXh}\) and \({\rho }_{ZXh}\) denote the covariance and coefficient of correlation between their subscripts in \(h^{th}\) stratum.In the presence of measurement errors, let \((x^*_{hi},{\ z}^*_{hi})\) and \((X^*_{hi},{\ Z}^*_{hi})\) be the observed and true values of \(X\) and \(Z\) respectively in \(h^{th}\) stratum. Let \(T^*_{hi}=z^*_{hi}-Z^*_{hi}\) and \(V^*_{hi}=x^*_{hi}-X^*_{hi}\) denote the measurement errors associated with \(Z\) and \(X\) respectively in \(h^{th}\) stratum. The measurement errors are assumed to be normally distributed with mean zero and variances \(S^2_{Th}\) and \(S^2_{Vh}\), for \(Z\) and \(X\) respectively in \(h^{th}\) stratum.
A relatively large sample of size \(n\) is drawn from the population using a simple random sampling without replacement (SRSWOR) and the units are classified into \(L\) homogeneous strata of size \(n'_h\) each. A second phase random sample of size \(n_h\) is drawn from the first phase sample using a simple random sampling without replacement (SRSWOR) and both the survey and auxiliary variables are studied. Let \({\overline{x}}'_h\) denote the first phase \(h^{th}\) stratum sample mean for \(X.\) Further, let \({\overline{x}}_h\) and \({\overline{z}}_h\) denote the second phase \(h^{th}\) stratum sample means for \(X\) and \(Z\) respectively. Let
3. Existing estimators in the literature
The ordinary mean estimator in the presence of measurement errors in stratified double sampling is defined as4. Proposed estimator
Let \({\overline{x}}'_h=\frac{1}{n'_h}\sum^L_{h=1}{x_{h\ }}\) and \({\overline{x}}_h=\frac{1}{n_h}\sum^L_{h=1}{x_h}\) denote the first and second-phase stratum sample means for the auxiliary variable respectively. Further, let \({\overline{z}}_h=\frac{1}{n_h}\sum^L_{h=1}{z_h}\) denote the mean for a scrambled response in the second phase stratum sample and \(w_h\) denote the \(h^{th}\) stratum weight. The proposed estimator of the finite population mean in the presence of measurement errors is given asSubstitute Equations (1)-(3) in (12) and solve using Taylor's approximation while ignoring terms of order greater than two, and then subtract the population mean to obtain
5. Members of family of Proposed generalized estimator
Members of the family of the proposed estimator are obtained as follows;- (i)   For \({\alpha }_h=\frac{1}{2}\), the proposed estimator reduces to ratio estimator given as
\begin{equation} \label{GrindEQ__18_} t_r=\sum^L_{h=1}{w_h{\overline{z}}_h}\left(\frac{{\overline{x}}'_h}{{\overline{x}}_h}\right) \end{equation}(18)\begin{equation} \label{GrindEQ__19_} Bias\left(t_r\right)\cong \sum^L_{h=1}{\frac{W_h}{{\overline{X}}_h}}\left[\frac{9}{8}\ R_h\left(A_h-C_h\right)-\left(E_h-D_h\right)\right], \end{equation}(19)\begin{equation} \label{GrindEQ__20_} MSE(t_r)\cong \sum^L_{h=1}{W^2_h\left[B_h+R^2_h\left(A_h-C_h\right)-2R_h(E_h-D_h)\right]} \end{equation}(20)
- (ii)   For \({\alpha }_h=1\) , the proposed estimator reduces to exponential ratio-type estimator given as
\begin{equation} \label{GrindEQ__21_} t_{err}=\sum^L_{h=1}{w_h{\overline{z}}_h}\left(\frac{{\overline{x}}'_h}{{\overline{x}}_h}\right)exp\left(\frac{{\overline{x}}'_h-{\overline{x}}_h}{{\overline{x}}'_h +{\overline{x}}_h}\right) \end{equation}(21)\begin{equation} \label{GrindEQ__22_} Bias\left(t_{err}\right)\cong \sum^L_{h=1}{\frac{W_h}{{\overline{X}}_h}}\left[\frac{15}{8}\ R_h\left(A_h-C_h\right)-\frac{3}{2}\left(E_h-D_h\right)\right],\ \ \ \ \ \ \ \ \ \ \end{equation}(22)\begin{equation} \label{GrindEQ__23_} MSE(t_{err})\cong \sum^L_{h=1}{W^2_h\left[B_h+\frac{9}{4}R^2_h\left(A_h-C_h\right)-3R_h(E_h-D_h)\right]} \end{equation}(23)
- (iii)   For \({\alpha }_h=0\) , the proposed estimator reduces to exponential ratio- product- type estimator given as
\begin{equation} \label{GrindEQ__24_} t_{erp}=\sum^L_{h=1}{w_h{\overline{z}}_h}\left(\frac{{\overline{x}}'_h}{{\overline{x}}_h}\right)exp\left(\frac{{\overline{x}}_h-{\overline{x}}'_h}{{\overline{x}}_h{\overline{+x}}'_h}\right) \end{equation}(24)\begin{equation} \label{GrindEQ__25_} Bias\left(t_{erp}\right)\cong \sum^L_{h=1}{\frac{W_h}{{\overline{X}}_h}}\left[\frac{3}{8}\ R_h\left(A_h-C_h\right)-\frac{1}{2}\left(E_h-D_h\right)\right], \end{equation}(25)\begin{equation} \label{GrindEQ__26_} MSE(t_{erp})\cong \sum^L_{h=1}{W^2_h\left[B_h+\frac{1}{4}R^2_h\left(A_h-C_h\right)-R_h(E_h-D_h)\right]} \end{equation}(26)
6. Efficiency comparison
In this section, the performances of the proposed estimators are studied theoretically.- i.   From Equations (11) and (17), \({MSE(t_g)}_{min}-Var\left(t_0\right)< 0\) if \[{\mathrm{(}E_h\ -D_h)}^2 >0.\]
- ii.  From Equations (17) and (20), \({MSE(t_g)}_{min}-MSE\ \left(t_r\right)< 0\) if \[{(D_h\ -E_h)}^2-R^2_h\ {\left(A_h\ -C_h\right)}^2-2R_{h\ }\left(E_h\ -D_h\right)\left(A_h\ -C_h\right)>0.\]
- iii.   From Equations (17) and (23), \({MSE(t_g)}_{min}-MSE\ \left(t_{err}\right)< 0\) if \[{(D_h\ -E_h)}^2-\frac{9}{4}R^2_h{\left(A_h\ -C_h\right)}^2-{3R}_{h}\left(E_h-D_h\right)\left(A_h-C_h\right)>0.\]
- iv.   From Equations (17) and (26), \({MSE(t_g)}_{min}-MSE\left(t_{erp}\right)< 0\) if \[{(D_h\ -E_h)}^2+\frac{1}{4}R^2_h{\left(A_h\ -C_h\right)}^2-R_{h}\left(E_h-D_h\right)\left(A_h-C_h\right)>0.\]
7. Numerical study
7.1. Introduction
A numerical study is conducted using both simulated and real data sets to compare the performance of the proposed estimator with some existing estimators in the literature. The real data set is obtained from Sarndal et al., [18]. The simulated data is generated using \(R-\)programming Language. The data sets consist of the survey variable, \(Y\) and auxiliary variable, \(X.\) Scrambling responses that are normally distributed, \(S_{hi}\ \sim \ N\left(0,\ 2\right)\) is generated for each unit in the data set. Thereafter, the response variable is obtained as \(Z_{hi}=Y_{hi}+S_{hi}\). Finally, normally distributed measurement errors with mean 2 and variance 5 are introduced to each unit of the response and auxiliary variables. The efficiency of the proposed estimator is compared with other estimators using the minimum variance and the Percent Relative Efficiency (PRE) approaches. The Percent Relative Efficiency (PRE) of the estimators are obtained using the expression;Population I: Simulated data
Stratum 1
\begin{align*}X_1&=rnorm(100,\ 450,\ 15),\\ x_1&=X_1+rnorm(100,\ 2,\ 5),\\ Y_1&=0.8+0.5X_1+rnorm(100,\ 0,\ 1),\\ Z_1&=Y_1+rnorm(100,\ 0,\ \ 0.2), \ \ \ \text{ and}\\ z_1&=Z_1+rnorm(100,\ 2,\ 5).\end{align*}Stratum 2
\begin{align*} X_2&=rnorm(250,\ 50,\ 15),\\ x_2&=X_2+rnorm(250,\ 2,\ 5),\\ Y_2&=0.8+0.5X_2+rnorm(250,\ 0,\ 1),\\ Z_2&=Y_2+rnorm(250,\ 0,\ \ 0.2),\ \ \ \text{and}\\ z_2&=Z_2+rnorm(250,\ 2,\ 5).\end{align*}Stratum 3
\begin{align*}X_3&=rnorm(300,\ 920,\ 25),\\ x_3&=X_3+rnorm(300,\ 2,\ 5),\\ Y_3&=0.8+0.5X_3+rnorm(300,\ 0,\ 1),\\ Z_3&=Y_3+rnorm(300,\ 0,\ \ 0.2),\ \ \ \text{and}\\ z_3&=Z_3+rnorm(300,\ 2,\ 5).\end{align*}Stratum 4
\begin{align*} X_4&=rnorm(350,\ 500,\ 8),\\ x_4&=X_4+rnorm(350,\ 2,\ 5),\\ Y_4&=0.8+0.5X_4+rnorm(350,\ 0,\ 1),\\ Z_4&=Y_4+rnorm(350,\ 0,\ \ 0.2),\ \ \ \text{and}\\ z_4&=Z_4+rnorm(350,\ 2,\ 5).\end{align*}Population II: Sarndal et al., [18]
The population consist of five strata of sizes; \(\mathrm{N1\ =\ 38,\ N2\ =\ 14,\ N3\ =\ 11,\ N4\ =\ 33,\ and\ N5\ =\ 24}\). Table 1 represents summary statistics for populations I and II.
Table 1. Parameters for populations I and II.
Population | stratum | \({\overline{X}}_h\) | \({\overline{Z}}_h\) | \(S^2_{Xh}\) | \(S^2_{Zh}\) | \({\rho }_{XZh}\) | \(S^2_{Th}\) | \(S^2_{Vh}\) |
---|---|---|---|---|---|---|---|---|
I | 1 | 450.2457 | 227.7285 | 227.9771 | 81.01574 | 0.8406767 | 22.31754 | 20.40296 |
2 | 577.5290 | 291.1661 | 3583.724 | 929.5202 | 0.9824869 | 30.78505 | 27.55788 | |
3 | 921.7221 | 463.7038 | 643.6014 | 212.0282 | 0.9236006 | 30.68011 | 25.67958 | |
4 | 499.8988 | 252.5883 | 61.48334 | 46.27591 | 0.613076 | 28.21903 | 22.29580 | |
II | 1 | 1029.158 | 16.09219 | 3667896 | 327.0976 | 0.7177369 | 30.54519 | 22.30025 |
2 | 25671.57 | 29.88566 | 6568461403 | 3617.208 | 0.9645813 | 26.56678 | 25.47327 | |
3 | 5028.818 | 28.29478 | 63348743 | 1493.623 | 0.979968 | 22.46011 | 19.04889 | |
4 | 7533.939 | 82.67373 | 440717912 | 45688.17 | 0.3021371 | 29.60195 | 17.35155 | |
5 | 16315.25 | 22.62072 | 408441212 | 405.7601 | 0.8939683 | 16.27989 | 21.05136 |
7.2. Discussion
Tables 2 and 3 show the contribution of measurement errors and the Randomized Response Technique (RRT) to the bias, mean squared error (MSE), and Percent Relative Efficiency (PRE) of the mean estimators. Through numerical study, it is observed that the Mean Squared Error (MSE) for the estimators are lower in cases without measurement errors but increases when measurement errors are introduced into the survey. Moreover, the Percent Relative Efficiency (PRE) for the mean estimators decreases when measurement errors are present in the survey. Additionally, the proposed generalized estimator has the minimum bias compared to other estimators of the finite population mean. A very significant finding of the study is that the proposed estimator performs better than other estimators under both cases for with and without measurement errors for both real and simulated data.
Table 2. Bias (in brackets), MSE and PRE of estimators for population I.
MSE | PRE | MSE | PRE | |
---|---|---|---|---|
\(t_0\) | 0.1389270 | 100 | 0.1964786 | 100 |
\(t_g\) | 0.0620388 | 223.8805 | 0.1075719 | 182.6487 |
(0.0001302) | (0.0001629) | |||
\(t_r\) | 0.0007016 | 198.0145 | 0.1118722 | 175.6281 |
(0.0001325) | (0.0001789) | |||
\(t_{err}\) | 0.0817005 | 170.0022 | 0.1322906 | 148.5205 |
(0.0003902) | (0.0004957) | |||
\(t_{erp}\) | 0.0811046 | 171.2513 | 0.1291244 | 152.1623 |
(-0.0001238) | (-0.0001379) |
Table 3. Bias (in brackets), MSE and PRE of estimators for population II.
MSE | PRE | MSE | PRE | |
---|---|---|---|---|
\(t_0\) | 62.54516 | 100 | 63.04892 | 100 |
\(t_g\) | 58.18433 | 107.4949 | 58.95167 | 106.9502 |
(0.124300) | (0.132101) | |||
\(t_r\) | 76.12109 | 82.16552 | 78.65928 | 80.15446 |
(1.721466) | (1.929885) | |||
\(t_{err}\) | 110.9480 | 56.36996 | 116.9480 | 53.91193 |
(3.065028) | (3.414083) | |||
\(t_{erp}\) | 59.98440 | 104.2690 | 60.83360 | 103.6416 |
(0.377905) | (0.4456877) |
7.3. Conclusion
The study proposes a generalized estimator of the finite population mean for a sensitive variable using a non-sensitive auxiliary variable in the presence of measurement errors based on the Randomized Response Technique (RRT). Expressions for the bias and Mean Squared Error (MSE) for the proposed estimator have been derived up to the first order of approximation. The performance of the proposed estimator has been studied both theoretically and numerically. The numerical study reveals that the presence of measurement errors in a survey based on the Randomized Response Technique (RRT) increases the variance and Mean Squared Error (MSE) resulting in biased estimates of the finite population mean. Finally, the proposed strategy is applicable in surveys involving sensitive variables such as bribery, cheating in examination, drug abuse, homosexuality, habitual tax evasion, reckless driving, abortion, indiscriminate gambling among others.Acknowledgments
Authors are thankful to the anonymous referee for his constructive comments and feedback.Conflicts of Interest
The authors declare no conflict of interest.References
- Warner, S. L. (1965). Randomized response: A survey for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309), 63-69. [Google Scholor]
- Shalabh. (1997). Ratio method of estimation in the presence of measurement errors. Journal of Indian Society of Agricultural Statistics, 50(2), 150-155. [Google Scholor]
- Diwakar, S., Sharad, P., & Narendra, S. T. (2012). An Estimator for Mean Estimation in Presence of measurement error. Research and Reviews: A Journal of Statistics, 1(1), 1-8. [Google Scholor]
- Yadav, D., Sheela, M., & Dipika. (2017). Estimation of population mean using auxiliary information in presence of measurement errors. International Journal of Engineering Sciences and Research Technology, 6(6), DOI: 10.5281/zenodo.817860. [Google Scholor]
- Gajendra, K. V., Abhishek, S., & Neha S., (2020). Calibration under measurement errors. Journal of King Saud University Science, 32(7), 29502961. [Google Scholor]
- Eichhorn, B. H., & Hayre, L. S. (1983). Scrambled randomized response methods for obtaining sensitive quantitative data. Journal of Statistical Planning and Inference, 7, 307-316. [Google Scholor]
- Gupta, S., & Shabbir, J. (2004). Sensitivity estimation for personal interview survey questions. Statistica, 64(4), 643-653. [Google Scholor]
- Gupta, S., Shabbir, J., & Sehra, S. (2010). Mean and sensitivity estimation in optional randomized response models. Journal of Statistical Planning and Inference, 140(10), 2870-2874. [Google Scholor]
- Sousa, R., Shabbir, J., Rael. & Gupta, S. (2010). Ratio estimation of the mean of a sensitive variable in the presence of auxiliary information. Journal of Statistics Theory and Practice, 36(3), 495-507. [Google Scholor]
- Tanveer, A. T., & Housila, P. S. (2015). A general procedure for estimating the mean of a sensitive variable using auxiliary information. Revista Investigacion Operacional, 36(3), 268-279. [Google Scholor]
- Mushtaq, N., Noor-Ul-Amin, M., & Hanif, M. (2017). A family of estimators of a sensitive variable using auxiliary information in stratified random sampling. Pakistan Journal of Operation Research, 13(1), 141-155. [Google Scholor]
- Mushtaq, N., Noor-Ul-Amin, M., & Hanif, M. (2016). Estimation of population mean of a sensitive variable in stratified two-phase sampling. Pakistan Journal of Statistics, 32, 393-404. [Google Scholor]
- Mushtaq, N., & Noor-Ul-Amin, M. (2020). Joint influence of double sampling and randomized response technique on estimation method of mean. Applied Mathematics, 10(1), 12-19. [Google Scholor]
- Naeem, N., & Shabbir, J. (2018). Use of a scrambled response on two occasion's successive sampling under nonresponse. Hacettepe Journal of Mathematics and Statistics, 47(3), 675-684. [Google Scholor]
- Zahid, E., & Shabbir, J. (2019). Estimation of finite population mean for a sensitive variable using dual auxiliary information in the presence of measurement errors. PLoS ONE, 14(2), e0212111. [Google Scholor]
- Sadia K. (2017). Generalized mean estimators for sensitive and non-sensitive variables in the presence of measurement errors. PhD Thesis, National College of Business Administration and Economics, Lahore. [Google Scholor]
- Zhang Q. (2020). Mean estimation of sensitive variables under measurement errors and non-response. PhD Thesis. The University of North Carolina at Greensboro. [Google Scholor]
- Sarndal, C., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling. New York: Springer. [Google Scholor]