Tuesday, November 08, 2005
how much can we rely on user testing alone?
I believe the usability community has become distracted by the question "how many test subjects are enough?" Champions of "discount usability" developed a mathematical formula that supposedly proves that adding more test subjects after 5 or 6 yields little new information. The formula works if you don't care about the underlying assumptions, but if you are curious about them, you find the formula only works in ideal situations, where you are doing a "health check" on something you expect mostly will work, with a homogeneous group of test subjects. If either of these conditions aren't true, all bets are off about now many users you need.
User testing is an opportunity to test hypothesis about what users need, within the context of other design constraints. Despite the obvious annoyance of having to run tests with users who offer no further enlightenment, and the extra cost of such superfluous testing, one needs to also acknowledge one never knows in a preliminary test what will test poorly, and can't prejudge the scope and scale of issues. Doing proper iterative testing, where every design requirement is subjected to multiple tests, will throw up issues everywhere. Because the scope of testing is fluid in iterative testing, one can't say how a result will necessarily settle. You play with alternatives to get reactions as long as there is diversity in reaction, and project time and budget to explore this diversity.
Another complication arises when a single design is meant to serve diverse users. One client I have worked with on many projects segments users into various age categories, and also whether they are consumers or business customers. All these segments need to be covered in testing because the client's stakeholders organize their products and processes around these segments. But prior to testing, one is never sure how these segments might differ in reaction to prototype designs. They might all react the same way, in which case checking all their reactions seems like overkill in retrospect. What often happens is that one or two subjects differ from the overall test subject population. Is this an expression of their segment preferences, or is it noise? Because the groups have been subdivided so finely, it can be hard to tell.
User testing is wonderful data, but it can be difficult to draw over-arching conclusions from it. I therefore encourage clients to do pre-design user research, so user preferences and needs can be at least partly established before actually tested. Such user research also allows one to learn if a design is bombing because of a design compromise due to an external requirement -- you guessed users wouldn't be keen on the compromise, and indeed they weren't.