Early analysis of the data from the first few months of the survey highlighted some common errors made by participants. These included:
- Putting time spent on activities for the whole month rather than a session;
- Providing a figure greater than 28 days for number of days participated in an activity in the last four weeks;
- Putting minutes in the hours box (e.g. 20 hours instead of 20 minutes);
- Not answering all the questions needed to create some of the key participation variables (e.g. providing days and missing hours and minutes);
- Double counting activities, especially in the gym (e.g. recording that they did a combined session as well as recording each individual activity within it);
- Recording the same activity in “other” and under the provided code (which may be correct) which means even in online data, when combined it can generate high numbers of days in a month spent on one activity when back code data are put back into the main code for an activity;
- Recording a height and weight combination which results in an improbable BMI.
These errors were handled in various points of data processing.
Some of these errors were corrected in initial edits (e.g. answers of >23 hours per session are not permitted, and in manual checking some obvious mistakes can be corrected).
However, some of these errors cannot be rectified by manual checks of raw data since they only manifest themselves in creating derived variables, or because there is no clear way to correct them.
At the same time, because the activities build across composite variables, it was necessary to take action to manage extreme values.
It was also important to compensate for missing data at key variables, because if every case with missing or extreme data on one of the components feeding into the key sports participation variables was excluded there would have been too much missing data.
Nonetheless there are some missing data which could not be compensated for (e.g. missing days): where days are missing, it was assumed that time spent doing the activity in the month was 0.
After a period of testing against early data and discussion between Sport England and Ipsos, a protocol was agreed for dealing with these issues:
- Where people provided information about the frequency with which they undertook an activity but did not provide sufficient information to calculate the duration of session (time spent on an activity per day), the session value was imputed as the median session duration for that activity.
- Extreme session values were capped to remove outliers. The upper limit was set as the 95th percentile for durations for each activity except where there were insufficient cases to calculate the percentile. In this instance, the 95th percentile for a similar activity was used as the cap limit.
- To handle the same gym or fitness activities being entered twice (under a combined gym session, and under each individual activity done within the session), a series of flags was created which suggested duplication of gym activities:
- If all session lengths add up to the sum of the combined session
- If total days across all fitness activities sum to more than 28
- If the number of days equal the number of days for the combined session
- Where both combined sessions and individual activities are reported, and the individual activities are short (say less than 15 minutes).
Where an individual reported a combined gym session and also reported completing at least two individual activities and triggered at least two rules from the list above, then it was assumed that they had duplicated their responses.
In these cases, individual sessions data were removed in creating the derived variables so that just the combined gym session remains (for all measures involving data in last 28 days – weekly participation, twice monthly, frequency, duration, MEM28 and MEM7 – but not for participation in the last 12 months).