Data editing
-
Postal data – Forced edits
The postal data are subject to errors introduced by respondents, as well as errors resulting from scanning or manual entry. Many of these errors can be dealt with through standard edit rules.
Read more about Postal data – Forced editsFor example, if a single code question has more than one category ticked, it is set to ‘missing – incorrectly multi-coded’. If a routed question is asked when it should not be, then it can be set to ‘not applicable’ and the original answer over-written.
If a respondent says there are no adults in the household (including them), then the answer can be set to be 1 since we assume that they excluded themselves (although of course if someone wrote 1, when they should have given an answer of 2, this is less easy to identify).
Many respondents ticked all the qualifications they had, rather than the highest one – a forced edit was used to retain the highest qualification.
For the sport and activity data, if someone did not tick that they did it in the last year, but then provided data on their participation in the last 28 days, the data were edited to record that they had done the activity in the last year.
A full record is kept of the forced edits done on the data. These edits are done to improve the quality of the data and to make them more consistent and easier to analyse.
Read less about Postal data – Forced edits -
Postal data – Manual edits
In examining the early data from the paper questionnaire and comparing it with the scanned data, it was apparent that there were certain common errors in the data which would either not be corrected by forced edits, or where a forced edit might lead to unnecessary missing data.
Read more about Postal data – Manual editsFor example, respondents sometimes start ticking in one row on the activity grid, but then mistakenly move up or down a line, further along the row.
This means they may say they have walked for travel in the last year, then have no further data on that row, but on walking for leisure have information on participation in the last month, but no data about the last year.
It was common for people to tick 'no' to the question about disability and then change their answer to 'yes'.
In a forced edit this would be treated as missing because multiple answers were given but in looking at the questionnaire it was possible to see the 'no' answer had been crossed out, but both were picked up by the scanner.
Therefore, alongside the forced edits a series of manual edits were specified. These were for errors which might be corrected better by viewing the questionnaire.
Postal questionnaires were scanned using dedicated scanning software – FAQSS. It can process large volumes of paper responses quickly, accurately, and efficiently.
Once a questionnaire is scanned, FAQSS reads and interprets the responses. It can tell the difference between actual answers and stray marks or smudges, so only the real data gets captured.
With Intelligent Character Recognition (ICR), FAQSS reads handwriting and turns it into digital text. This means handwritten answers are automatically captured without needing to be manually typed in.
The system uses a built-in programming language that allows it to be adapted to the project’s needs, capturing different question formats.
To ensure data was captured accurately and errors are caught early, a data verification process was implemented.
Written instructions are provided to coders to explain what should be done and the kinds of problems to be rectified at each question. Options include:
- No change – data are as captured and associated forced edit will deal with any problems;
- Change because captured data are not what was written or intended by respondent (e.g. multi-code had one answer crossed out and so the crossed-out answer can be removed, a 7 read as 1, etc.);
- Change because data are what was written but an obvious error was made which can be rectified (e.g. answers to breathing questions ticked in wrong row, when it is clear which activity they should be applied to, it is clear that ‘no’ to out of breath and sweaty was used to mean ‘I didn’t do the activity’).
Any complex cases which the coder has queries about are flagged and reviewed by a researcher. The data are then output into an error file which is applied to the data and the manually edited data are thus captured, alongside any remaining forced edits.
Read less about Postal data – Manual edits -
Online data
The online data need less editing as the checks and edits are found within the questionnaire.
Read more about Online dataFor example, extreme high values of time spent doing activities, or durations of less than 10 minutes, are checked.
Where multiple answers are selected on single code answer questions, respondents are asked to correct their answers.
After the data were received in the office, rules were set for defining missing values and a small number of further edits were possible.
Read less about Online data -
Missing values
In the survey data, there are various reasons why a question may not have been answered.
Read more about Missing valuesOn the postal data, a question which should have been answered may have been missed by respondents.
On the online survey, to allow respondents to proceed past questions which they may not know the answer to or do not wish to answer, codes are used for the answers which allow them to say, ‘don’t know’ or ‘prefer not to say’.
There are also questions which may not be applicable because they were not asked for respondents.
Missing values and codes used
Code used Description -99 Missing, should have been answered -98 Not applicable: Survey routing -97 Incorrectly multi-coded -96 Out of range -95 Don't know/Cannot give estimate -94 Prefer not to say Wherever possible, the base for questions has been set to all participants. However, for questions not asked at all for one group, missing values must be used.
For the main activity measures (e.g. whether participated at least twice in the last 28 days, or at least 150 minutes of moderate-intensity activity in the last 28 days), the base is all participants.
If there are missing data on one of the activities, this is just treated as not having done the activity.
This is because there are so many different activities asked about and so many different variables which feed in (number of sessions, minutes and two intensity questions) that if anyone with missing data on one or more of these variables were excluded, there would be a huge number of respondents for whom these key measures could not be calculated.
Furthermore, the questionnaire was designed such that the absence of a tick for having done the activity in the last year is treated as not having done the activity, so there are no missing data on whether the activity was done in the last year.
Read less about Missing values