October 11, 20227 min read1,335 words

How To Run A Heuristic Evaluation Without Ceremony

The quiet test that ten engineers can run in an hour

The expert evaluation that does not happen is worth zero. The expert evaluation that happens in an afternoon is worth everything that comes from it.

A product team I joined had not run a heuristic evaluation in three years. The reason was not that they thought the method was obsolete, it was that the last time they had run one, the engagement had taken six weeks, produced a forty-page report, and arrived three months after the relevant features had shipped. The team had concluded the method was too expensive to run again, and had quietly stopped running it. Most teams I have worked with since have arrived at the same conclusion through some version of the same experience.

The honest read of the situation is that the canonical heuristic evaluation has been priced out of reach by the ceremony around it, and the ceremony was not a property of the method, it was a property of the consulting industry that taught the method.

What the canonical version assumes that it does not need to assume

The textbook heuristic evaluation assumes a team of three to five trained evaluators, a multi-day inspection schedule, a structured reporting template with severity ratings on a five-point scale, and a stakeholder review meeting at the end. Each of those assumptions is defensible in isolation. Together they produce an engagement that costs more than most teams can spare, especially in product cycles measured in two-week sprints.

The training assumption is the first one to question. The ten classic principles fit on a single sheet of paper. Anyone on the team who has read the sheet and looked at the interface for an hour can identify clear violations. Subtle violations require more practice, but the clear ones, which are the ones that drive most of the value, do not.

The team-of-five assumption is the second to question. The original recommendation came from a study showing that more evaluators surface more issues, with diminishing returns past five. Five is a ceiling, not a floor. A single evaluator finds roughly a third of the issues five evaluators would find. Two evaluators find roughly half. The marginal cost of adding evaluators is real, the marginal value is real, and the choice of how many to include should match the team’s appetite, not the textbook.

The structured-report assumption is the third. Severity ratings are useful when the audience for the report needs to triage hundreds of findings. When the audience is the team itself, fixing the issues over the next sprint, the severity rating is replaced by a conversation in standup. The report becomes a list of findings with a one-sentence note each. The format matches the team’s actual workflow, and the report becomes something the team will actually read.

What the informal version actually looks like

The informal heuristic evaluation that has worked for the teams I have watched run it has roughly this shape. A designer prints the ten principles on one sheet, gathers two or three teammates from any role, picks one user-facing flow to inspect, and books an hour. Each evaluator walks the flow on their own laptop, marking down anything that violates one of the principles. The group then reconvenes for thirty minutes to compare notes and dedupe.

The output is a list. Not a report. The list has one row per finding, with three fields. The principle violated. The location in the product. The fix the team thinks is right. The list goes into the same backlog the team uses for everything else, with the items tagged so they can be triaged together. Total time investment, including the meeting, is usually under three hours of designer time and an hour each from two teammates.

The output is also notably consistent across runs. A team that runs an informal evaluation every quarter will find that the recurring violations are usually the same three or four. The team learns where the product’s structural usability weaknesses are, and the recurring findings stop being surprising. They become a quiet maintenance backlog the team works on between bigger initiatives. The evaluation transitions from an event to a check-in.

This is roughly the opposite of how the formal evaluation is typically described. The formal version is sold as a heavyweight intervention. The informal version that actually runs is a light habit.

Why the team that runs it informally beats the team that books a consultant

The team that hires a consultant to run a formal evaluation gets a thorough report, a thoughtful severity-ranked list, and a presentation that puts the findings in front of leadership. The team also gets the report three months after they hoped to have it, because consultancies are booked out, and the report covers a version of the product that has shipped two updates since the inspection started. The findings are real, and many of them are stale.

The team that runs an informal evaluation in-house gets a shorter list, with less rigour, the same week. The findings are fresh because the inspection happened against the live product. The fixes ship in the next sprint or two, and the team has the satisfaction of having moved the metric, which keeps the practice alive. The next quarter, they run another one.

Compounding favours the informal version. A team that runs an evaluation four times a year, even at a quarter of the rigour of the formal version, will find substantially more issues over a year than a team that runs one rigorous evaluation a year. The marginal cost of the informal evaluation is so low that the team can afford to run it on every major flow before launch, and the team will catch issues that a more careful but less frequent inspection would miss.

The expert evaluation that does not happen is worth zero. The expert evaluation that happens in an afternoon is worth everything that comes from it.

The wrinkle about the ten principles

A real concern about the informal version is that the ten classic principles, formulated in 1994, are no longer a complete description of what makes an interface usable. Several of them have aged well. Visibility of system status, error prevention, and consistency are timeless. Others have aged less gracefully, particularly in the context of product categories the original principles did not anticipate.

A team running the informal evaluation today should treat the ten principles as a starting point, not a complete framework. Adding a small set of supplementary heuristics for the team’s specific product context usually helps. A team building a data analytics product might add a principle about interpretability of charts. A team building a developer tool might add a principle about latency of feedback. A team building a consumer subscription product might add a principle about pricing transparency.

The supplementary principles are themselves a useful artefact, because writing them down forces the team to articulate what specifically counts as good usability in their domain. A team that has done this work will run heuristic evaluations that catch issues a textbook evaluator would miss. The textbook evaluator is generic by necessity. The team is specific by privilege.

What stops most teams from running it

The thing that stops most teams from running the informal evaluation is, almost always, the assumption that it is not worth running unless it is run formally. The team has read about the textbook version, has noted the resource requirements, and has filed the method under “things we will do when we have more time.” The more time never arrives, the method never gets run, and the team ships products with usability issues that an afternoon would have caught.

A useful tactic for the designer who wants to introduce the practice is to run the first one alone, without asking permission, and then bring the findings to the next sprint planning. The team sees concrete issues with concrete fixes. The next quarter, the designer asks for an hour from two teammates. The quarter after, the practice is ambient.

The path to a team that runs heuristic evaluations is paved by the team running heuristic evaluations.

Terms / explained

Described terms.

Heuristic evaluation: A usability inspection method in which evaluators walk through an interface against a small set of established usability principles, recording violations they observe.
Usability heuristics: A short list of design principles, most famously the ten formalised in 1994, that describe properties of usable interfaces and serve as the inspection criteria during heuristic evaluation.
Severity rating: A numeric or descriptive scale applied to each finding in a heuristic evaluation, indicating how serious the violation is and how much priority the fix deserves, typically ranging from cosmetic to catastrophic.
Cognitive walkthrough: A related inspection method that focuses on how a first-time user would learn an interface step by step, distinct from heuristic evaluation in that it traces a specific task rather than scanning the interface against principles.

FAQ / questions

Frequently asked.

What is a heuristic evaluation?

A method of inspecting an interface against a small set of established usability principles, performed by one or more evaluators who walk through the product looking for violations. The classic version uses ten principles formalised in 1994, including visibility of system status, match between system and real world, user control, consistency, error prevention, recognition over recall, flexibility, minimalist design, error recovery, and help documentation.

Why do most teams skip heuristic evaluation?

Because the canonical version of the method is presented as a multi-week research engagement requiring three to five trained evaluators, written reports, and severity ratings. Most product teams do not have that capacity, do not see how to compress it, and end up never running the evaluation at all. The informal version, run by anyone on the team for an afternoon, costs almost nothing and surfaces most of what the formal version would surface.

Who should run a heuristic evaluation?

Anyone with a working understanding of the ten principles can run a useful evaluation, including engineers, product managers, support staff, and customer-facing roles. Trained UX researchers will catch more subtle violations and produce more rigorous severity ratings, but the difference is a matter of degree. The team that runs an informal evaluation with a designer and two engineers gets ninety per cent of the value of the formal version at five per cent of the cost.

Ask / a model

Request an AI summary.

Hand this take to your model of choice for a summary, a deeper read, or a critique. Each link pre-fills a prompt that points the model at this page.

ChatGPT Claude Google AI Perplexity

Read / further