
Patterns from a month of user testing: what people try first, what surprises them, and what we got wrong.
We ran sixty user sessions over the last month. People used the agent for between five minutes and two hours. We watched. We did not help. We took notes.
Almost everyone started with a question about the weather. We are not sure why (the product has nothing to do with weather), but it was the universal first prompt. The second prompt was usually a calendar question. The third was usually a longer, more ambitious request that the agent had to push back on.
The thing that surprised people most was not what the agent could do. It was the speed at which it could fail. When the agent did not understand a request, the failure mode was instant and clear, and people moved on without frustration. Slow failure is worse than fast failure.
We assumed people would speak in short, command-like sentences. They did not. People speak in long, rambling, half-corrected thoughts, and any product that requires the user to clean up their speech is going to feel unnatural. We rewrote most of our parsing to handle the way people actually talk, not the way we wished they would.