Task-Centered User Interface Design
A Practical Introduction
by Clayton Lewis and John Rieman
Copyright ©1993, 1994: Please see the "shareware notice" at the front of the book.
Contents | Foreword | ProcessUsers&Tasks | Design | Inspections | User-testing | Tools |Documentation |

5.1 Choosing Users to Test
5.2 Selecting Tasks for Testing
5.3 Providing a System for Test Users to Use
5.4 Deciding What Data to Collect
5.5 The Thinking Aloud Method
        5.5.1 Instructions
        5.5.2 The Role of the Observer
        5.5.3 Recording
        5.5.4 Summarizing the Data
        5.5.5 Using the Results
5.6 Measuring Bottom-Line Usability
        5.6.1 Analyzing the Bottom-Line Numbers
        5.6.2 Comparing Two Design Alternatives
5.7 Details of Setting Up a Usability Study
        5.7.1 Choosing the Order of Test Tasks
        5.7.2 Training Test Users
        5.7.3 The Pilot Study
        5.7.4 What If Someone Doesn't Complete a Task?
        5.7.5 Keeping Variability Down
        5.7.6 Debriefing Test Users

5.6.1 Analyzing the Bottom-Line Numbers

When you've got your numbers you'll run into some hard problems. The trouble is that the numbers you get from different test users will be different. How do you combine these numbers to get a reliable picture of what's happening?

Suppose users need to be able to perform some task with your system in 30 minutes or less. You run six test users and get the following times:

         20 min
         15 min
         40 min
         90 min
         10 min
          5 min

Are these results encouraging or not? If you take the average of these numbers you get 30 minutes, which looks fine. If you take the MEDIAN, that is, the middle score, you get something between 15 and 20 minutes, which looks even better. Can you be confident that the typical user will meet your 30-minute target?

The answer is no. The numbers you have are so variable, that is, they differ so much among themselves, that you really can't tell much about what will be "typical" times in the long run. Statistical analysis, which is the method for extracting facts from a background of variation, indicates that the "typical" times for this system might very well be anywhere from about 5 minutes to about 55 minutes. Note that this is a range for the "typical" value, not the range of possible times for individual users. That is, it is perfectly plausible given the test data that if we measured lots and lots of users the average time might be as low as 5 min, which would be wonderful, but it could also be as high as 55 minutes, which is terrible.

There are two things contributing to our uncertainty in interpreting these test results. One is the small number of test users. It's pretty intuitive that the more test users we measure the better an estimate we can make of typical times. Second, as already mentioned, these test results are very variable: there are some small numbers but also some big numbers in the group. If all six measurements had come in right at (say) 25 minutes, we could be pretty confident that our typical times would be right around there. As things are, we have to worry that if we look at more users we might get a lot more 90-minute times, or a lot more 5-minute times.

It's the job of statistical analysis to juggle these factors -- the number of people we test and how variable or consistent the results are -- and give us an estimate of what we can conclude from the data. This is a big topic, and we won't try to do more than give you some basic methods and a little intuition here.

Here's a cookbook procedure for getting an idea of the range of typical values that are consistent with your test data.

   1. Add up the numbers. Call this result "sum of x". In our example this is 180.
   2. Divide by the n, the number of numbers. The quotient is the average, or mean, of the measurements. In our example this is 30.
   3. Add up the squares of the numbers. Call this result "sum of squares" In our example this is 10450.
   4. Square the sum of x and divide by n. Call this "foo". In our example this is 5400.
   5. Subtract foo from sum of squares and divide by n-1. In our example this is 1010.
   6. Take the square root. The result is the "standard deviation" of the sample. It is a measure of how variable the numbers are. In our example this is 31.78, or about 32. 7. Divide the standard deviation by the square root of n.
   7. is is the "standard error of the mean" and is a measure of how much variation you can expect in the typical value. In our example this is 12.97, or about 13.
   8. It is plausible that the typical value is as small as the mean minus two times the standard error of the mean, or as large as the mean plus two times the standard error of the mean. In our example this range is from 30-(2*13) to 30+(2*13), or about 5 to 55. (The "*" stands for multiplication.) 

What does "plausible" mean here? It means that if the real typical value is outside this range, you were very unlucky in getting the sample that you did. More specifically, if the true typical value were outside this range you would only expect to get a sample like the one you got 5 percent of the time or less.

Experience shows that usability test data are quite variable, which means that you need a lot of it to get good estimates of typical values. If you pore over the above procedure enough you may see that if you run four times as many test users you can narrow your range of estimates by a factor of two: the breadth of the range of estimates depends on the square root of the number of test users. That means a lot of test users to get a narrow range, if your data are as variable as they often are.

What this means is that you can anticipate trouble if you are trying to manage your project using these test data. Do the test results show we are on target, or do we need to pour on more resources? It's hard to say. One approach is to get people to agree to try to manage on the basis of the numbers in the sample themselves, without trying to use statistics to figure out how uncertain they are. This is a kind of blind compromise: on the average the typical value is equally likely to be bigger than the mean of your sample, or smaller. But if the stakes are high, and you really need to know where you stand, you'll need to do a lot of testing. You'll also want to do an analysis that takes into account the cost to you of being wrong about the typical value, by how much, so you can decide how big a test is really reasonable.

HyperTopic: Measuring User Preference

Another bottom-line measure you may want is how much users like or dislike your system. You can certainly ask test users to give you a rating after they finish work, say by asking them to indicate how much they liked the system on a scale of 1 to 10, or by choosing among the statements, "This is one of the best user interfaces I've worked with", "This interface is better than average", "This interface is of average quality", "This interface is poorer than average", or "This is one of the worst interfaces I have worked with." But you can't be very sure what your data will mean. It is very hard for people to give a detached measure of what they think about your interface in the context of a test. The novelty of the interface, a desire not to hurt your feelings (or the opposite), or the fact that they haven't used your interface for their own work can all distort the ratings they give you. A further complication is that different people can arrive at the same rating for very different reasons: one person really focuses on response time, say, while another is concerned about the clarity of the prompts. Even with all this uncertainty, though, it's fair to say that if lots of test users give you a low rating you are in trouble. If lots give you a high rating that's fine as far as it goes, but you can't rest easy.

Copyright © 1993,1994 Lewis & Rieman
Contents | Foreword | ProcessUsers&Tasks | Design | Inspections | User-testing | Tools |Documentation |