Longitudinal and Life Course Studies
An international journal

Letter to the editor: Don’t forget survey data: ‘healthy cohorts’ are ‘real-world’ relevant if missing data are handled appropriately

View author details View Less
  • 1 University College London, , UK
Open access
Get eTOC alerts
Rights and permissions Cite this article

Dear Professor Joshi,

We write to you regarding the published article ‘Are “healthy cohorts” real-world relevant? Comparing the National Child Development Study (NCDS) with the ONS Longitudinal Study (LS)’ by Archer et al (2020). The authors report that NCDS is unrepresentative of age-matched LS respondents, but that despite differences in sample characteristics, longitudinal associations were similar in the NCDS and LS samples. They attribute the discrepancy between NCDS and LS to a ‘healthy cohort’ effect and propose that creating non-response weights from administrative data should be used. While we agree with Archer et al that administrative data have the potential to inform missing data analyses in longitudinal surveys, the authors omit to mention that even without administrative data there are already methods available to researchers to restore sample representativeness using survey information alone that have been shown to be highly effective.

To demonstrate the effectiveness of using survey information – without augmentation by administrative data – in restoring sample representativeness in NCDS with respect to the LS, we present Table 1 from their manuscript, with additional columns from our own analyses. We accounted for non-response at age 46 and 55 with multiple imputation (MI), using chained equations (Azur et al, 2011; White et al, 2011; Harel et al, 2018) to generate 50 imputed datasets.1 The imputation phase included ‘auxiliary variables’ (Carpenter and Kenward 2012) from earlier sweeps of NCDS that were associated with non-response at ages 46 and 55 and the outcome of interest (long-term limiting illness for example), as well as variables that are known to be associated only with the outcome of interest.2

Table 1:

Sample characteristics (prevalence and 95% confidence interval unless otherwise stated)

NCDS 2004 (age 46) n = 8,689aONS LS 2001 (age 45)NCDS 2013 (age 55) n = 8,107aONS LS 2011 (age 55)
Archer et al

Table 1
Our calculationsArcher et al

Table 1

n = 7,157c
Archer et al

Table S4

n = 6,393d
Archer et al Table 1Our calculationsArcher et al

Table 1

n = 7,052c
Archer et al

Table S4

n = 6,170d
Observed

data
MIbObserved

data
MIb
Long-term limiting illness
Yes14.915.019.719.7 (18.8, 20.6)22.6 (21.5, 23.6)22.822.5
No85.185.080.380.3 (79.4, 81.2)77.4 (76.4, 78.5)77.277.5
Missing (n)141991151150155127
Sex
Male48.748.8 (47.7, 49.8)51.1 (50.2, 51.9)49.449.948.548.5 (47.4, 49.6)50.7 (49.9, 51.6)49.349.9
Female51.351.2 (50.2, 52.3)48.9 (48.1, 49.8)50.650.151.551.5 (50.4, 52.6)49.3 (48.4, 50.1)50.750.1
Missing (n)0000000000
Ethnicity
White98.098.1 (97.8, 98.3)96.8 (96.5, 97.1)90.396.997.997.9 (97.6, 98.2)96.8 (96.4, 97.1)88.395.8
Non-white2.01.9 (1.7, 2.2)3.2 (2.9, 3.5)9.73.12.12.1 (1.8, 2.4)3.2 (2.9, 3.6)11.74.2
Missing (n)010011311307011695
Region
South47.947.9 (46.9, 49.0)46.3 (45.3, 47.2)49.447.246.048.1 (47.0, 49.2)45.8 (44.9, 46.8)50.147.4
North46.146.1 (45.1, 47.2)47.0 (46.0, 47.9)45.347.048.146.0 (44.9, 47.1)47.8 (46.8, 48.8)44.646.7
Wales6.06.0 (5.5, 6.5)6.8 (6.2, 7.3)5.35.86.06.0 (5.5, 6.5)6.4 (5.9, 6.9)5.35.9
Missing (n)3302200000
Employment status
Full-time69.069.0 (68.0, 70.0)66.3 (65.3, 67.3)61.162.461.261.2 (60.2, 62.3)57.8 (56.5, 59.0)55.256.0
Part-time18.418.3 (17.5, 19.2)17.0 (16.2, 17.8)17.718.020.220.2 (19.3, 21.1)18.8 (17.9, 19.7)19.019.5
Unemployed1.71.6 (1.4, 1.9)2.4 (2.0, 2.8)3.23.12.92.9 (2.5, 3.2)4.0 (3.5, 4.6)4.34.3
Long-term sick/disabled4.04.0 (3.6, 4.4)6.1 (5.5, 6.6)6.36.45.25.2 (4.7, 5.7)7.6 (6.9, 8.3)9.29.1
Looking after home/family5.45.4 (4.9, 5.9)6.0 (5.5, 6.5)7.36.26.26.2 (5.7, 6.8)7.2 (6.5, 7.8)5.14.5
Othere1.71.7 (1.4, 2.0)2.1 (1.7, 2.5)4.44.04.34.3 (3.9, 4.8)4.6 (4.1, 5.2)7.16.6
Missing (n)000321201200151126
Social class NS-SEC
Professional/higher management41.941.9 (40.8, 42.9)g33.934.835.735.7 (34.7, 36.8)g29.230.0
Intermediate19.819.8 (19.0, 20.7)g18.418.623.123.1 (22.2, 24.0)g20.120.7
Routine and manual25.725.7 (24.7, 26.6)g28.028.320.920.9 (20.0, 21.8)g25.125.0
Otherf12.712.7 (12.0, 13.4)g19.718.320.320.3 (19.4, 21.2)g25.624.3
Missing (n)2727g32120187g00
Marital status
Married71.171.1 (70.1, 72.0)67.3 (66.3, 68.3)68.767.871.571.5 (70.5, 72.5)65.6 (64.4, 66.8)70.069.3
Divorced/separated/widowed17.517.5 (16.7, 18.3)19.8 (18.9, 20.7)18.719.118.618.6 (17.8, 19.5)21.9 (20.8, 22.9)19.419.5
Single11.511.5 (10.8, 12.2)12.9 (12.2, 13.7)12.713.19.99.9 (9.2, 10.5)12.6 (11.8, 13.3)10.611.2
Missing (n)1818017145505543
Living arrangements
No partner22.922.921.021.0 (20.1, 21.9)25.8 (24.7, 27.0)26.926.2
Spouse68.167.769.169.1 (68.1, 70.1)62.6 (61.4, 63.9)64.965.1
Co-habiting9.09.410.010.0 (9.3, 10.6)11.5 (10.8, 12.3)8.28.7
Missing (n)47410004936
Housing tenure
Own – outright14.314.3 (13.6, 15.1)14.1 (13.4, 14.9)16.215.734.035.3
Own – mortgage71.571.5 (70.5, 72.4)66.8 (65.8, 67.8)63.565.143.443.9
Rent/other14.214.2 (13.5, 15.0)19.1 (18.2, 19.9)20.319.222.620.8
Missing (n)3939019314010176

Notes:

NCDS sample restricted to those resident in England and Wales.

Multiple imputation. Imputation model includes analysis variables (with the exception of social class NS-SEC), predictors of non-response at sweep 7/9 and selected variables predictive of analysis variables.

Including all LS respondents.

Excluding LS respondents who arrived in the UK after age 16.

Full-time education, government training scheme, retired, temporarily sick or disabled.

Never worked, long-term unemployed, not working, unclassifiable.

Social class NS-SEC not included in MI analysis due to collinearity with employment status.

In Table 1 we see that after accounting for loss to follow up with MI that includes auxiliary information from the NCDS survey itself, most estimates from NCDS are closer to those from LS, and do not show the discrepancy highlighted in the comparisons made by Archer et al. Results for the estimated prevalence of long-term limiting illness are shown in Figure 1. Taking into consideration that there are likely to be other potential sources of variation between NCDS and LS that were not accounted for by Archer et al that mean that we would not expect there to be a perfect match (age and calendar period effects, missing data handling in LS, minor differences in the way some questions were asked, and potential mode effects), our results suggest that using the methods described, NCDS sample representativeness with respect to LS was quite effectively restored.

Figure 1: After accounting for loss to follow up with multiple imputation that includes auxiliary information from the National Child Development Study, the estimated prevalence of long-term limiting illness is similar to that from the ONS Longitudinal Study.
Figure 1:

Estimated prevalence of long-term limiting illness

Citation: Longitudinal and Life Course Studies 13, 2; 10.1332/175795921X16428748347208

Notes:LS1: Estimate from ONS LS data including all LS respondents (from Archer et al Table 1).LS2: Estimate from ONS LS data excluding LS respondents who arrived in the UK after age 16 (from Archer et al Table S4).Archer et al: Estimate using observed NCDS Sweep 9 data (from Archer et al Table 1).Observed: Estimate using observed NCDS Sweep 9 data (our own calculation).MI: Estimate using multiple imputation (our own calculation).

These corrections do not constitute a formal test for missing data generating mechanisms, and there could be other variables in NCDS where we wouldn’t be able to replicate the known population distribution with these methods. However, in our published work (Mostafa et al, 2021), we show that we are also able to replicate the known population distribution of educational attainment and marital status at age 50 based on external benchmarks (using the ONS Annual Population and Labour Force Surveys), as well as using internal benchmarks, by replicating the original distribution of paternal social class observed at the birth survey, and the distribution of cognitive ability at age 7.

While we have no doubt that the addition of information from population administrative data, in creation of weights, or by using these in multiple imputation or full information maximum likelihood could enhance these methods yet further, the extent of their benefits remains an open empirical question, and is likely to be modest relative to the survey data corrections described earlier. Our work in progress funded by the Economic and Social Research Council and Administrative Data Research UK (grant number ES/V006037/1) is augmenting these corrections using additional population administrative data, from hospital and educational records, and will be published in due course.

By making no attempt in their analyses to use survey responses to correct for missing data due to non-response/loss to follow up, Archer’s et al findings are open to a clear misinterpretation by readers that there is nothing to be done to restore representativeness in NCDS and/or other longitudinal surveys, if administrative data are not used. This is far from the truth. Using appropriate methods, estimates from NCDS are indeed ‘real-world’ relevant and can be used for policy inference. Further guidance on how users can adopt these methods for missing data handling in NCDS in their own analyses is available in the NCDS Missing Data User Guide, and we also offer a programme of regular user training.3

Notes

1

In this approach we view missing data analysis as an attempt to restore sample representativeness with respect to a well-defined target population. The target population of NCDS, and any other longitudinal survey, is dynamic, as changes occur for example due to mortality. Considering that the NCDS mortality rate is representative of the population (Mostafa et al, 2021), the target population in each sweep of NCDS needs to be adjusted accordingly to reflect these changes. In this instance the target population for our analyses are those born in Britain in 1958, alive at the time of data collection and still residing in Britain.

Missing values of the analysis variables were imputed using MI, with the exception of two variables: sex and ethnicity. We know sex (for all cohort members) and ethnicity (for virtually all cohort members) from previous sweeps. We therefore (singly) imputed these variables with their known values. We acknowledge that self-reported sex and ethnicity may vary over time within individuals, whereas this approach treats them as being fixed, but we would suggest that in ‘real-world’ analyses most analysts would be willing to make this assumption in order to handle missing data. After imputing these variables with their known values, sex is complete but ethnicity still has some missing values, which were handled using MI.

2

Analyses of age 46 outcomes included 23 predictors of non-response at age 46 (as identified in Mostafa et al, 2021) and 11 variables considered predictive of underlying missing values: region at ages 0, 23 and 42, marital status at ages 23, 33 and 42, housing tenure at ages 23, 33 and 43, and employment status at ages 33 and 42. Analyses of age 55 outcomes included 30 predictors of non-response at age 55 (as identified in Mostafa et al, 2021) and 12 variables considered predictive of underlying missing values: region at ages 0, and 23, long-term limiting illness at ages 33 and 42, employment status at ages 33 and 50, marital status at ages 33, 42, 46 and 50, and living arrangements at ages 46 and 50.

Conflict of interest

The authors declare that there is no conflict of interest.

References

  • Archer, G., Xun, W.W., Stuchbury, R., Nicholas, O. and Shelton, N. (2020) Are healthy cohorts real-world relevant? Comparing the National Child Development Study (NCDS) with the ONS Longitudinal Study (LS), Longitudinal and Life Course Studies, 11(3): 30730. doi: 10.1332/175795920X15786630201754

    • Search Google Scholar
    • Export Citation
  • Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J. (2011) Multiple imputation by chained equations: what is it and how does it work?, International Journal of Methods in Psychiatric Research, 20(1): 409. doi: 10.1002/mpr.329

    • Search Google Scholar
    • Export Citation
  • Carpenter, J. and Kenward, M. (2012) Multiple Imputation and Its Application, Chichester: Wiley.

  • Harel, O., Mitchell, E.M., Perkins, N.J., Cole, S.R., Tchetgen Tchetgen, E.J., Sun, B. and Schisterman, E.F. (2018) Multiple imputation for incomplete data in epidemiologic studies, American Journal of Epidemiology, 187(3): 57684. doi: 10.1093/aje/kwx349

    • Search Google Scholar
    • Export Citation
  • Mostafa, T., Narayanan, M., Pongiglione, B., Dodgeon, B., Goodman, A., Silverwood, R.J. and Ploubidis, G.B. (2021) Missing at random assumption made more plausible: evidence from the 1958 British birth cohort, Journal of Clinical Epidemiology, 136: 4454.  doi: 10.1016/j.jclinepi.2021.02.019

    • Search Google Scholar
    • Export Citation
  • White, I.R., Royston, P. and Wood, A.M. (2011) Multiple imputation using chained equations: Issues and guidance for practice, Statistics in Medicine, 30(4): 37799. doi: 10.1002/sim.4067

    • Search Google Scholar
    • Export Citation
  • Archer, G., Xun, W.W., Stuchbury, R., Nicholas, O. and Shelton, N. (2020) Are healthy cohorts real-world relevant? Comparing the National Child Development Study (NCDS) with the ONS Longitudinal Study (LS), Longitudinal and Life Course Studies, 11(3): 30730. doi: 10.1332/175795920X15786630201754

    • Search Google Scholar
    • Export Citation
  • Azur, M.J., Stuart, E.A., Frangakis, C. and Leaf, P.J. (2011) Multiple imputation by chained equations: what is it and how does it work?, International Journal of Methods in Psychiatric Research, 20(1): 409. doi: 10.1002/mpr.329

    • Search Google Scholar
    • Export Citation
  • Carpenter, J. and Kenward, M. (2012) Multiple Imputation and Its Application, Chichester: Wiley.

  • Harel, O., Mitchell, E.M., Perkins, N.J., Cole, S.R., Tchetgen Tchetgen, E.J., Sun, B. and Schisterman, E.F. (2018) Multiple imputation for incomplete data in epidemiologic studies, American Journal of Epidemiology, 187(3): 57684. doi: 10.1093/aje/kwx349

    • Search Google Scholar
    • Export Citation
  • Mostafa, T., Narayanan, M., Pongiglione, B., Dodgeon, B., Goodman, A., Silverwood, R.J. and Ploubidis, G.B. (2021) Missing at random assumption made more plausible: evidence from the 1958 British birth cohort, Journal of Clinical Epidemiology, 136: 4454.  doi: 10.1016/j.jclinepi.2021.02.019

    • Search Google Scholar
    • Export Citation
  • White, I.R., Royston, P. and Wood, A.M. (2011) Multiple imputation using chained equations: Issues and guidance for practice, Statistics in Medicine, 30(4): 37799. doi: 10.1002/sim.4067

    • Search Google Scholar
    • Export Citation

Content Metrics

May 2022 onwards Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 27 27 27
PDF Downloads 23 23 23

Altmetrics

Dimensions