A pruned hotdeck imputation approach to estimation from linked survey and administrative data
Location: ABS Office, Canberra or Adelaide preferred
Duration: 4 months
Proposed start date: April 2019
Regression Data Integration (RDI) is a new method for combining probability sample and non-probability Big Data sources to produce more efficient statistics. The method involves identifying the units on the probability sample that are also on the Big Data set, and then partially calibrating the probability sample to a set of Big Data benchmarks, including the data item(s) of interest.
When there are only a few key data items of interest, the number of benchmarks required will be small and the calibration exercise will be effective. When the number of key data items are large in number (as is the case in multipurpose surveys), the calibration will become less efficient, or not converge at all. We could calibrate each key data item separately to get around this, but then a separate weight will need to be created each time. Calibrating each item separately may also impact on the relationships between data items for a unit – in some cases invalidating those relationships.
A mass imputation approach has been proposed as a follow-up step to the application of the RDI method to get around these issues. This project will develop a “pruned hotdeck” approach to this imputation. This approach has two steps:
- A multiple hotdeck step that imputes multiple sets of realistic values, each set copied from a random unit from the target unit’s “imputation group”.
- A pruning step that iterates through the units and at each iteration removes an imputed set of values so as to improve a measure of overall discrepancy from the set of required calibration totals. This proceeds until there is a single imputed set of values for each target unit, while the overall imputed dataset achieves a high level of consistency with the required totals.
This project would build on methods developed in the ABS as published in 1352.0.55.092 – Research Paper: Imputation and Estimation for a Thematic Form Census (Methodology Advisory Committee), November 2007.
Research to be Conducted
The objective of the research will be to refine the pruned hotdeck imputation approach, and examine its properties empirically using synthetic and real data. Tasks will include:
- Refining the imputation approach – through literature search as well as theory development;
- Generating code to implement the method.
Running simulations to assess empirically the effectiveness of the method in various scenarios.
We are looking for someone with the following skillset:
- Sound knowledge of sample survey methods, in particular: techniques for treating missing survey data (imputation), calibration methods, and replicate variance estimation.
- Proficiency in using SAS or R statistical software.
The key outcome for the organisation will be to gain a greater understanding of the potential benefits that the mass imputation approach (using pruned hotdeck imputation) has for achieving cost savings in surveys (measured by sample size savings), while preserving the quality of broad-level estimates and facilitating production of small domain estimates. It is expected that several outputs will be produced from this work: a feasibility report, informal presentations to interested ABS stakeholders, contribution to a conference presentation about the topic, and a prototype implementation of the method in SAS or R.
The intern will receive $3,000 per month of the internship, usually in the form of stipend payments.
It is expected that the intern will primarily undertake this research project during regular business hours, spending at least 80% of their time on-site with the industry partner. The intern will be expected to maintain contact with their academic mentor throughout the internship either through face-to-face or phone meetings as appropriate.
The intern and their academic mentor will have the opportunity to negotiate the project’s scope, milestones and timeline during the project planning stage.
03 April 2019
APR – 0891