Big Data Methods for multipurpose surveys

Location: ABS Office, Preference Canberra ACT, alternatively Brisbane QLD

Duration: 4 months

Proposed start date: April 2019

Project Background

Regression Data Integration (RDI) is a new method for combining probability sample and non-probability Big Data sources to produce more efficient statistics.  The method involves identifying the units on the probability sample that are also on the Big Data set, and then partially calibrating the probability sample to a set of Big Data benchmarks, including the data item(s) of interest.

When there are only a few key data items of interest, the number of benchmarks required will be small and the calibration exercise will be effective.  When the number of key data items are large in number (as is the case in multipurpose surveys), the calibration will become less efficient, or not converge at all.  We could calibrate each key data item separately to get around this, but then a separate weight will need to be created each time.  Calibrating each item separately may also impact on the relationships between data items for a unit – in some cases invalidating those relationships.

A mass imputation approach has been proposed as a follow-up step to the application of the RDI method to get around these issues.  The approach involves finding nearest neighbour record imputes (within hot deck classes) for units we don’t have any information about (ie are not on the Big Dataset or the probability sample), based on values predicted for them via a model.  The nearest neighbour approach has the advantage that it retains relationships between data items at the unit level.  Additionally, it will result in a population-level dataset (removing the need for weights) and will facilitate the creation of small domain estimates.

Research to be Conducted

The objective of the research will be to refine the nearest neighbour/hot deck imputation approach, and examine its properties empirically using synthetic and real data.  Tasks will include:

  • Refining the imputation approach – through literature search as well as theory development
  • Running simulations to assess empirically the effectiveness of the method in various scenarios
  • Formulate variance and bias properties for the method

Skills Required

We are looking for someone with the following skillset:

  • Sound knowledge of sample survey methods, in particular: techniques for treating missing survey data (imputation), calibration methods, and replicate variance estimation.
  • Proficiency in using SAS or R statistical software.

Expected Outcomes

The key outcome for the organisation will be to gain a greater understanding of the potential benefits that the mass imputation approach (using nearest neighbour imputation) has for achieving cost savings in surveys (measured by sample size savings), while preserving the quality of broad-level estimates and facilitating production of small domain estimates.  It is expected that several outputs will be produced from this work: a feasibility report, informal presentations to interested ABS stakeholders, contribution to a conference presentation about the topic, and a prototype implementation of the method in SAS.

Additional Details

The intern will receive $3,000 per month of the internship, usually in the form of stipend payments.

It is expected that the intern will primarily undertake this research project during regular business hours, spending at least 80% of their time on-site with the industry partner.  The intern will be expected to maintain contact with their academic mentor throughout the internship either through face-to-face or phone meetings as appropriate.

The intern and their academic mentor will have the opportunity to negotiate the project’s scope, milestones and timeline during the project planning stage.

Applications Close

03 April 2019

Reference

APR – 0888