The great thing about online testing tools is that, not only are the results compiled and partially analyzed for us, they are usually available instantly, even while the test is still “live”.

Very few people will wait until the test is done before peeking at the results. We generally take a look once we’ve hit 20 or 30 respondents, just to see how things are going.

Often, they’re going a lot worse than we expected. For example, we may have been hoping for a success rate in the 60s or 70s, but the early results often come in at 40 or even less.

For example, we ran three tree tests for an electricity company. We expected a low score for the existing site tree, but we also got low scores for the new trees we had designed:

	Existing tree	New tree 1	New tree 2
Uncorrected score	36%	40%	35%

While this was lower than we expected across the board, we told ourselves not to panic. From experience, we knew that the original data needs cleaning up before it accurately shows what happened.

There are a few common reasons for this:

Participants may have misinterpreted some of our tasks, leading to a lot of wrong answers.
There’s not much we can do about this once the results start coming in, other than to take those tasks’ results with a grain of salt, and to fix those tasks in later tests.
Some participants may have given garbage answers, for a variety of reasons.
We can remove some or all of these after the fact.
It’s likely that participants discovered some correct answers that we missed.
We can fix these after the fact too.

Safeguarding the original data

The first thing to do, before we touch the data at all, is to back it up. We always want to have a copy of the original data in case anything goes wrong with the tool or our data clean-up.

For online tools, the most common way to preserve our data is to download it as a spreadsheet file (XLS, CSV, etc.).

Download the entire data set and save it in a secure place. Explicitly name it as the original data, then never touch that file again. It’s not for later analysis – it’s simply a backup of the raw data in case Murphy’s Law strikes and we need something to go back to.

This is also useful if the tool we’re using goes offline, either temporarily or (eventually) permanently. In the unlikely case that we need to go back and look at the raw numbers a year or two from now, we’ll want to have our own offline copy of the data.

Removing garbage sessions

Most online studies collect some garbage data.

This is pretty much a given in the land of online research. Most participants make an honest effort, but a few don’t.

This is especially true when the study is open to the public and there’s an incentive involved. We get more respondents, but some of them are just there for the prize, and will zoom through the study as quickly as possible to get to the pay-off.

Sometimes this means lower-quality data.
They did the task in a rush, so their decisions were not as considered as they might be in real life. On the other hand, many people “rush” through their normal web browsing, so it’s hard to quantify this effect.
Sometimes this means garbage data.
They clicked randomly, or chose the same option each time, just to get through the test quickly.

We can normally weed out the latter, and reduce some of the former.

Going too fast

The first thing we look at is sessions that were done too quickly. Most tools track the total time taken for each participant, and this becomes a good way to weed out garbage data.

There’s no hard-and-fast rule about what “too quickly” means, but if most respondents did our study in 8 minutes, and we find someone who did it in less than 2, that’s what we’re getting at.

After sifting through a lot of data, our general rule of thumb is: be skeptical of sessions that were done in less than half the average time.

So, in our example above, if the average session was 8 minutes, we would look at sessions that were done in less than 4 minutes.

So, what do we do with a suspect session?

If the session was done really fast (say, a quarter of the average time or less), we delete that session out of hand. It’s extremely unlikely that the participant could have performed the tasks with any thought at that speed.

For the remaining sessions (those approaching “half time”, or whatever threshold we’re comfortable with), we review the data. This means looking at the click paths of each task for that participant, to see if there are any clear indications they were intentionally speeding through our tests. The most common indications are

Choosing the same item at each level (often the first or last item)
Going down the same path for every task
Choosing nonsense paths for every task
Careful with this one, because what we consider a “nonsense path” might have made sense to them. Only suspect those who do this for a large number of tasks.

If we find a session with a lot of this kind of garbage, we delete it, and if we are doing a prize draw for this study, we remove that participant from the draw. This is not a behavior to encourage.

Some tools let us “exclude” sessions from the analysis. This is what we think of as a “soft” deletion; the session is removed from the analysis, but it’s not actually deleted, so we can get it back later if we change our minds. If our tool offers exclusions, we recommend using them to clean up the data instead of actually deleting the data outright.

Skipping too many tasks

Another way that some participants speed through a test is by skipping tasks.

Again, we rely on a rule of thumb to expedite our clean-up: be skeptical of sessions where the participant skipped more than half the tasks.

If we find participants who skipped much more (say 70% or more), we delete those sessions without further review.

For the remaining few, we review their click paths (like we did above for those who went too fast), look for the same indicators, and delete/exclude those that look guilty.

Being wrong way too often

The final check we do is for participants who got every task wrong (or something close to that). This is a clue that they may not have made an honest effort.

The low-achiever rule of thumb is: be skeptical of participants who got 75% (or more) of the tasks wrong.

Again, we review their click paths as described above, look for the same indicators, and mete out our justice accordingly.

Updating correct answers

One more way that we “clean up” the data is changing the correct answers for our tasks.

This happens later in our analysis, when we’re examining where participants went for each task. Sometimes we’ll find that they have “discovered” answers that we should have marked correct at the outset. For more on this, see Where they went later in this chapter.

When we find additional correct answers (and this is surprisingly common), we need to go back and mark those new answers as correct, then make sure that the tool recalculates the results accordingly.

When we find pages that seem like very reasonable places to go for a given task, we need to make sure those pages actually help users who are performing that task in real life. We can either:

Include content on that page that satisfies the given task, or
Include a prominent cross-link to a page that does satisfy the task.

Recalculating the results

If we do alter the results (either by excluding/deleting sessions, or by adjusting correct answers), we need to make sure the scores are recalculated accordingly.

Depending on the tool we use, this may be done automatically, or we may need to trigger it manually.

And we should remember to download another local copy of the revised results, for safekeeping.

For the power-company study we described above, when we recalculated the scores after adding some missed correct answers, our results changed substantially (although they were still lower than we would have liked):

	Existing tree	New tree 1	New tree 2
Uncorrected score	36%	40%	35%
Corrected score	46%	43%	47%

Next: Sharing the data

Tree Testing for Websites

Cleaning the data