|Year : 2020 | Volume
| Issue : 3 | Page : 194-198
Methods to Handle Incomplete Data
Vinny Johny1, Mariamma Philip2, Swathi Augustine1
1 Department of Community Medicine, Pushpagiri Institute of Medical Sciences and Research Centre, Kerala
2 Department of Biostatistics, National Institute of Mental Health and Neuro Sciences, The Tamil Nadu Dr MGR Medical University, Chennai
|Date of Submission||06-Jun-2020|
|Date of Decision||17-Aug-2020|
|Date of Acceptance||25-Aug-2020|
|Date of Web Publication||16-Dec-2020|
Biostatistician, Department of Community Medicine, College of Medicine, Pushpagiri Institute of Medical Sciences and Research Centre, Kerala, India.
Source of Support: None, Conflict of Interest: None
Context: The major question for data analysis is determining the appropriate analytic approach in the presence of incomplete observations. The most common solution to handle missing data in a data set is imputation, where missing values are estimated and filled in. An important problem of imputation is to maintain the statistical significance of the data set. Aim: To compare different imputation techniques − complete case analysis, last observation carried forward (LOCF), mean imputation, hot deck Imputation, regression imputation, and multiple imputation (MI). Settings and Design: The data for the study were collected from a prospective study to find out the predictors of early response to treatment in drug naïve schizophrenia patients from a tertiary care centre, India. Methods and Material: The present study tries to compare four imputation methods: complete case analysis, LOCF, mean imputation, hot deck Imputation, regression imputation and MI, in filling up the missing values of the outcome variable. Statistical analysis used: Paired t test was used to compare the imputation methods. Results: At the fourth week, the positive and negative syndrome scale scores were missing for about a minority of the subjects (41%). Mean imputation differed significantly from LOCF (P = 0.001), regression imputation (P = 0.010) and MI (P = 0.002). LOCF differed significantly from all these methods − regression imputation (P = 0.001), hot deck imputation (P = 0.011) and MI (P = 0.001). Conclusions: LOCF and mean imputation methods are different from other imputation methods, and there is no difference between hot deck imputation, MI, and regression imputation.
Keywords: Imputation methods, mean imputation, multiple imputation, regression imputation
|How to cite this article:|
Johny V, Philip M, Augustine S. Methods to Handle Incomplete Data. MAMC J Med Sci 2020;6:194-8
Key Messages: In many scientific investigations, missing data are the most common issue. The most common solution to handle missing data in a data set is imputation. Commonly used imputation methods are complete case analysis, last observation carried forward, mean imputation, hot deck imputation, regression imputation, and multiple imputation.
| Introduction|| |
Missing values, common in surveys, clinical trials, and epidemiologic studies, are major issues in obtaining valid estimates. In many scientific investigations, missing data are the most common issue. The term missing data implies that the amount of information about the phenomena in which one is interested is incomplete. Missing data are simply observations that one intended to measure but did not, because of the unavoidable nature of missing data.
There are many problems that lead to missing values in a data set. Missing values mostly occur as a result of manual data entry procedures, equipment errors, incorrect measurements, and so on. Missing values are present that cause loss of efficiency and complications in handling and analyzing the data. Also, missing data can introduce bias that leads to misleading inferences about the parameters.
The missing data mechanism has an important role in the use of methods that handle missing values and these comprise the underlying process of generating missing data. Rubin introduced three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). The mechanism is MCAR, if the missingness is unrelated to the values of any variables, whether missing or observed. Data that are missing because a researcher dropped the test tubes or a survey participant accidentally skipped questions are likely to be MCAR. The mechanism is MAR, if the missingness is related to either observed covariates or response variable, whereas the mechanism is MNAR, if the missingness is related to the missing values. It commonly occurs when people do not want to reveal something very personal or unpopular about them.
For applying a statistical procedure for analyzing longitudinal data with missing values, it should be selected based on the mechanism of missingness and the amount of missingness present. So, there is no unique best method available for all situations. The most common solution to handle missing data in a data set is imputation, where missing values are estimated and filled in. The main problem occurring in imputation is to keep up the statistical distribution of the data set. Imputation is a general and flexible method for handling missing data problem and it is not without pitfalls. The main purpose of this study is to describe different imputation techniques such as complete case analysis, mean imputation, last observation carried forward (LOCF), regression imputation, hot deck imputation, and multiple imputation (MI) and to compare the performance of these imputation methods.
| Subjects and Methods|| |
A variety of approaches to handling missing data have emerged over the last few years. Below is a list of some of them, along with a brief description.
Complete case analysis
Complete case analysis is also known as “listwise deletion (LD)” or “case-wise deletion”. Complete case analysis is the simplest and the most common way to handle incomplete cases, that is, only cases with complete data will be analyzed. It is the default option in most statistical software. There are definite advantages to complete case analysis. If the missing data are at least MAR complete cases, then it will lead to parameter estimates that are unbiased. The only loss is to statistical power, and in many situations this is not a particularly important consideration because this type of study often has a high level of power to begin with. If the data are MNAR, this approach produces biased estimates., Because of confounding with missingness, it is very difficult to interpret the resulting model. However, it is certainly better than deleting subjects with missing data. Two problems arise with complete case analysis: (i) If the missing values differ systematically from the completely observed cases, this could bias the complete case analysis. (ii) If many variables are included in a model, there may be very few complete cases, so that most of the data would be discarded for the sake of a simple analysis.
Last observation carried forward
One of the most widely used imputation methods in longitudinal analysis is LOCF method. In this method, all missing values are to be replaced by the last observed value from the same subject; that is, it is an extrapolation of the last observed measurement and could be applied to both continuous and discrete variables. This is subjected to bias, especially if “missingness” and measurement processes are related.
Single-value imputation was once the most common imputation method. For a missing value, it imputes a single value; that is, a missing value is replaced by a single value (e.g., mean or median). The most commonly used single value is mean; hence, this method is also known as mean substitution. This is the easiest method for imputing missing values. In this method, the missing values are replaced by the mean of the observed values for that variable. The main advantage is that it is very simple to implement for any type of variable.
Estimate a regression model predicting the missing variable of interest for those in the sample with complete information. An ordinary least square regression model can be used to fit data and missing values can be predicted from the model. Then, compute the predicted scores from the regression coefficients for the individuals who are missing on the item. These predicted scores are used to replace the missing data. The regression equation then can be used to impute a value for that variable whenever an observation is missing. When there is a strong relationship between the variable that has missing observations and other independent variables, regression substitution is thought to work reasonably well. Regression imputation will increase the correlation among items because some of the items will have been explicitly calculated as a linear function of other items. This will affect the regression coefficients that result from the analysis. The imputed values would be expected to have less error than if the values were not missing.
Hot deck Imputation
One of the earliest methods of imputing missing values is hot deck imputation. This method searches for other respondents that have the same response patterns over matching variables. Matching variables are variables that are related to or predictive of the variables for which missing values are to be imputed. If a matching respondent is found for a nonrespondent, the respondent’s value is donated to the nonrespondent; that is, missing values of nonrespondents are replaced by the values of the matching respondent. This technique does not require distributional assumptions. In a way, this is seen as an improvement on the mean imputation, as this method creates more variability in the imputed values.
MI is a statistical technique for analyzing incomplete data sets; that is, data sets for which some entries are missing. Missing observations can be continuous or discrete. MI is a Monte Carlo technique in which the missing values are replaced by m > 1 simulated versions, where m is typically small (usually 3–10). In MI, we replace each missing item by two or more acceptable values representing a distribution of possibilities. The advantage of this method is that once the imputed data set has been generated, the analysis can be carried out using procedures in virtually any statistical packages. However, there are some disadvantages. Missing data individuals are allowed to have distinct probability, which indicates that individual variation is ignored. Furthermore, the uncertainty inherent in missing data is ignored because the analysis does not distinguish between observed and imputed values. MI technique requires three steps: imputation, analysis, and pooling [Figure 1].
- Imputation: Impute (fill in) the missing entries of the incomplete data sets, not once, but m times. Imputed values are drawn for a distribution. This step results in m complete data sets.
- Analysis: Analyze each of the m completed data sets. This step results in m analysis.
- Pooling: Integrate the m analysis results into a final result. Simple rules exist for combing the m analysis.
Proper use of MI requires three things:
- Impose a probability model on the complete data.
- Specify a prior distribution for the parameters of the imputation model.
- Assume an ignorable missing mechanism (MCAR/MAR).
Imputation is an attractive approach to analyzing incomplete data. However, unprincipled imputation method may create more problems than it solves. With the advent of new software, the technique of MI has become increasingly attractive for researchers in the biomedical, behavioural, and social science, whose investigations are hindered by missing data. MI is not the only principled or the best method for handling missing values. Good estimates can be obtained through weighted estimation procedures.
| Methods|| |
This article aims to compare different imputation techniques − complete case analysis, LOCF, mean imputation, hot deck imputation, regression imputation, and MI. The data for the study were collected from the prospective study designed to find out the predictors of early response to treatment in drug-free schizophrenia patients. These data contained information about 111 drug naïve schizophrenia patients. Patients of both sexes with a DSM-IV (Diagnostic Statistical Manual-IV) diagnosis of schizophrenia who had not received any psychotropic medications anytime and consented to be treated in the hospital were recruited into the study. Patients who had comorbid substance abuse, organic brain disorders, and mental retardation were excluded from the study. Positive and Negative Syndrome Scale (PANSS) was the outcome of the study. PANSS is an established instrument for measuring severity of schizophrenia and it was measured at weekly intervals. It is a 30-item, 7-point rating instrument to evaluate positive, negative, and other symptom dimensions on the basis of a formal semi-structured clinical interview and other information sources spanning over 45 minutes. From this, scores on three subscales, positive, negative, and general psychopathology can be arrived at.
Paired t test was used for the comparison of imputation techniques. Data were analyzed through STATA (8) and SPSS (14). The level of significance is considered as 0.05.
| Results|| |
The study sample consisted of 111 drug naïve schizophrenia patients. The mean (SD) age of the study sample was 31.17 (8.5) years and 51% were males. Mean (SD|) of the PANSS score at first week was 92.24 (20.99).
[Table 1] explicates the rate of missingness increased during the weeks of follow-up, and about 41% of the subjects had missing PANSS scores at fourth week assessment.
The comparison of complete case analysis with other imputation techniques using total PANSS scores is depicted in [Table 2]. There was no mean difference between mean imputation and complete case, whereas LOCF has the largest mean difference (6.63).
[Table 3] describes the comparison of different imputation techniques. Mean imputation differed significantly from LOCF (P = 0.001), regression imputation (P = 0.010), and MI (P = 0.002). Mean imputation and hot deck imputation produced similar results (P = 0.820). Also, regression imputation did not have a significant difference from hot deck imputation (P = 0.146) or MI (P = 0.593) and hot deck imputation and MI also did not differ significantly (P = 0.105).
However, LOCF differed significantly from all these methods − regression imputation (P = 0.001), hot deck imputation (P = 0.011), and MI (P = 0.001).
| Discussion|| |
The performance of complete case analysis, in the present study, assessed with respect to missing data measured as continuous variable was consistent with the findings from the review of literature. Literature on complete case analysis generally suggested that the method is safe, but it gives some bias.,,,,, Perhaps the simplest method to go about is the mean imputation. But many researchers suggested this would produce biased estimates of variance and should generally be avoided. In the present study, it was found that the performance of the mean imputation method assessed with respect to missing data measured as continuous variable was consistent with previous findings; that is, mean imputation method is different from other imputation methods.,, LOCF analysis uses all subjects and imputes the missing values with the last observed values, a method that assumes that the outcomes would not have changed from the last observed value. In this study, LOCF is different from other imputation methods. LOCF was not particularly conservative and gave relatively low power and high bias., Many researchers have suggested that regression imputation produced estimates closest to the original variables,, whereas in the present study, the performance of regression imputation method assessed with respect to missing data measured as continuous variable was consistent with previous findings; that is, regression imputation produced estimates closest to the original variables. The performance of the model-based procedure MI was evaluated as explained earlier with ice algorithm. The review of literature suggested that MI produced estimates that are consistent and it helped to decrease the bias in the data. In the present study, MI produced estimates are closets to the original data.,,,,,
| Conclusion|| |
In the present study, missing values are imputed and then compared with various imputation techniques, such as complete case analysis, LOCF, mean imputation, regression imputation, hot deck imputation, and MI, and we concluded that LOCF and mean imputation methods are different from other imputation methods and there is no difference between hot deck imputation, MI, and regression imputation.
We express our deep sense of gratitude to the Department of Biostatistics, National Institute of Mental Health and Neuro-Sciences, Bangalore, India for all the support.
Financial support and sponsorship
Conflicts of Interest
There are no conflicts of interest.
| References|| |
Rubin DB, Little RJA. Statistical Analysis with Missing Data. 2nd ed. New York: John Wiley & Sons; 2002. p. 1-23.
Pigott TD. A review of methods for missing data. Educ Res Eval 2001;7:353-83.
Musil CM, Warner CB, Yobas PK, Jones SL. A comparison of imputation techniques for handling missing data. West J Nurs Res 2002;24:815-29.
Wood AM. Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes. Int J Epidemiol 2004;34:89-99.
Nakai M, Ke W. Review of the methods for handling missing data in longitudinal data analysis. Int J Math Anal 2011;5:1-13.
Little RJA, Rubin DB. The analysis of social science data with missing values. Social Methods Res 1989;18:292-326.
Pérez A, Dennis RJ, Gil JFA, Rondón MA, López A. Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia. Stat Med 2002;21:3885-96.
Little RJA, Rubin DB. Statistical Analysis with Missing Data. 3rd ed. New York: John Wiley & Sons; 2019. p. 76.
He Y. Missing data analysis using multiple imputation: getting to the heart of the matter. Circ Cardiovasc Qual Outcomes 2010;3:98-105.
Gmel G. Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 2001;20:2369-81.
Kay SR, Fiszbein A, Opler LA. The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophr Bull 1987;13:261-76.
Fielding S, Fayers PM, Loge JH, Jordhøy MS, Kaasa S. Methods for handling missing data in palliative care research. Palliat Med 2006;20:791-8.
Fernández-Alonso R, Suárez-Álvarez J, Muñiz J. [Imputation methods for missing data in educational diagnostic evaluation]. Psicothema 2012;24:167-75.
Groenwold RHH, Donders ART, Roes KCB, Harrell FE, Moons KGM. Dealing with missing outcome data in randomized trials and observational studies. Am J Epidemiol 2012;175:210-7.
Olsen IC, Kvien TK, Uhlig T. Consequences of handling missing data for treatment response in osteoarthritis: a simulation study. Osteoarthr Cartil 2012;20:822-8.
Olsen MK, Stechuchak KM, Edinger JD, Ulmer CS, Woolson RF. Move over LOCF: principled methods for handling missing data in sleep disorder trials. Sleep Med 2012;13:123-32.
Twisk J, de Boer M, de Vente W, Heymans M. Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. J Clin Epidemiol 2013;66:1022-8.
[Table 1], [Table 2], [Table 3]