Why Context and Ambiguity Matter in Conversational AI
Between the Lines
At a mid‑sized tech company, an AI‑powered “Team Pulse” bot was deployed to monitor morale in internal Slack channels. Within two weeks it tagged 17% of engineering messages as highly positive, despite widespread frustration over shifting priorities. One trigger was a sarcastic remark: “Wow — our third production rollback this month. Really shows how stable our CI/CD pipeline is.” The bot logged the comment as praise and auto‑recommended the author for a “Reliability Champion” badge. Two days later the engineer publicly declined the badge, explaining that the message was born of despair, not pride. The team paused the bot’s sentiment analysis module to retrain it with annotated messages.
Why did the bot get it so wrong? Language models excel at pattern matching, but sarcasm relies on tone, timing and shared context, cues that are largely invisible in text. Current bots also lack what cognitive scientists call theory of mind — the ability to infer others’ beliefs and intentions. They treat token sequences at face value and ignore the subtle social signals humans use.
This newsletter digs into those blind spots. We will explore why context and ambiguity trip up today’s AI systems, how memory and prompt design can help or harm, and what psychological and ethical issues arise when humans talk to machines.
Reading Between the Lines
That misawarded reliability badge illustrates a larger problem: context starvation. When models are starved of relevant context they produce answers that may be syntactically correct but semantically wrong. Forbes recently described context as “everything the spreadsheet leaves out — goals, guardrails, jargon, user emotions, compliance rules and timing”. The same article noted that over 80% of AI projects stall or fail because the environment they are deployed in no longer matches the context they were trained on. Put differently, it is not that models are dumb; it is that they often misinterpret or omit the nuances that matter in real life.
Early conversational systems showed that people were willing to attribute understanding to simple pattern‑matching programs. Users confided in scripted responses because the machine reflected their statements back at them, not because it truly understood. As digital assistants matured, memory and context windows gradually expanded. Apple’s Siri popularized voice assistants in 2011 but could only handle single‑turn queries. Research and model improvements led to large context windows in the 2020s; Gemini 2.5 shipped with a 1 million token context window in 2025 and developers teased a 2 million token window soon after. Yet bigger windows do not guarantee understanding.
The Lost in the Middle study showed that models often forget information buried in the middle of long contexts, and the Context Rot report found that as input length increases, performance becomes unreliable. In fact, information placed toward the center of a context window leads to a 15–20 percentage point drop in accuracy compared with the same information at the beginning or end. These findings remind us that context is about focus, not just capacity.
Context Engineering: Solving the Right Problem
A 2025 article on context engineering argues that most AI agent failures stem from context mismanagement, not inferior models.
When designers dump entire documentation libraries, hundreds of conversation turns and every possible tool definition into the context window, they overwhelm the model. This practice, known as context pollution, forces the model to sift through noise and leads to a phenomenon called context rot, where accuracy decreases as context length grows. The fix is just‑in‑time retrieval: provide only the information needed for the current task and let the model fetch additional details on demand.
Tool management also matters. Enterprise deployments often expose dozens of tools to a single agent, creating ambiguous decision points and performance degrades when agents must choose among more than 5–10 tools. The recommended approach is to divide functionality among specialized sub‑agents and use a routing agent to delegate tasks.
Long interactions introduce another dimension: memory management. Naively keeping every past message until the context limit is reached leads to either truncated histories or bloated prompts. Better strategies include summarizing old turns while preserving key decisions, storing notes outside the context window and clearing raw tool outputs once they’ve served their purpose. It also warns against vague system prompts that assume shared understanding and insufficient prompt detail that leaves the model guessing. Clear instructions, examples, tone guidelines and escalation paths help the model operate at the right “altitude.”
Finally, context should be treated as a finite budget. Every token you add to a prompt has a cost: it consumes the model’s attention and may distract it from what matters. Asking whether each piece of information increases the signal or just noise helps designers maintain focus.
Ambiguity: Not All Vagueness Is Equal
Ambiguity arises when a query allows multiple valid interpretations. According to a January 2026 guide on ambiguous queries, many product teams first notice ambiguity through user complaints: irrelevant recommendations, misaligned analytics and support answers that miss the point. The article notes that 23% of ambiguous questions in the AmbigQA dataset involve entity‑reference ambiguities, but the majority stem from timing, answer type or missing constraints. Examples include “Show me affordable laptops for work,” where “affordable” and “work” mean different things to different users, and support queries like “It stopped working again,” which assume a shared understanding of which device or feature broke.
The same guide provides a taxonomy of ambiguity: lexical ambiguity occurs when a word has multiple meanings (“How do I charge Apple?” could mean billing the company, charging a phone or taking legal action); referential ambiguity arises when pronouns or noun phrases lack a clear referent; task and constraint ambiguity involve missing parameters; and persona ambiguity reflects differences in what “we” means across departments. Designing for clarity requires surfacing these missing details, either through the user interface (drop‑down menus, follow‑up questions) or through prompting patterns that ask the model to seek clarification before acting.
The Challenge of Resolving Ambiguous References
Pronouns like it, they or this are deceptively simple. Humans effortlessly resolve them by combining local context with commonsense knowledge and shared history. For AI systems, pronoun resolution can be surprisingly hard. The 2025 study “It Depends: Resolving Referential Ambiguity in Minimal Contexts” tests several large language models (DeepSeek v3, GPT‑4o, Qwen3‑32B, etc.) on multi‑turn conversations that require commonsense reasoning. Researchers created scenarios where the pronoun it could refer to multiple entities, such as a helicopter and a drum; humans know drums can’t fly, but models sometimes commit to the wrong referent or hedge by listing all possibilities.
The study found that current LLMs struggle to resolve ambiguity effectively, often committing to a single interpretation or enumerating all plausible references instead of asking for clarification. Simplifying the prompt language makes the problem worse, reducing the models’ use of commonsense reasoning. The authors highlight the importance of common ground—the mutual knowledge, beliefs and assumptions built over a conversation. When this common ground is missing or when models lack the ability to infer it, misinterpretation follows. Misinterpreted pronouns can cascade into larger failures such as misinformation, hallucinations or user confusion.
For designers, these findings suggest adding follow‑up questions when referential ambiguity is detected, using dialogue patterns that encourage the model to ask “Which item do you mean?” instead of guessing. They also reinforce the value of structuring data so that important entities have clear labels and contexts.
Psychology Lens: Why We Project Meaning onto Machines
Language is inherently social. Humans unconsciously apply social heuristics when interacting with computers, a phenomenon known as the media equation. Research in psychology has shown that people are polite to computers, exhibit gender stereotyping toward machine voices and reciprocate favors with software. When a chatbot says “please” and “thank you,” users feel more engaged. Conversely, when a bot delivers curt, contextless answers, users feel offended. Designing for human expectations therefore requires more than technical accuracy; it requires social awareness.
The mismatch between human expectations and AI capabilities has practical consequences. When a model confidently answers a question, users assume it is correct even if it hallucinated the response. Conversely, when a model prefaces its answer with hedging language (“I’m not sure, but...”), users perceive it as less competent. Striking the right balance between confidence and honesty is a delicate design choice. Encouraging models to admit uncertainty and ask for clarification can build trust and reflect a more human conversational norm, even if it occasionally frustrates users.
Challenge and Reflection: Designing for Ambiguity and Accountability
Context and ambiguity are not bugs; they are features of human communication. The challenge for AI designers is to build systems that recognize their own limitations and seek clarification rather than guessing. Should an AI assistant ever guess when faced with multiple plausible interpretations, or should it always ask a follow‑up question? How do we ensure that memory features respect user privacy and remain transparent? Who is accountable when a chatbot’s recommendation goes wrong?
As Gal Steinberg points out, context is not just about getting the right answer — it’s about knowing who owns the consequences.
Thanks for reading!
Talk soon,
— Hasti


Hey, I just came across your work on the SheWritesAI directory and the title caught my attention. So glad I added it to my reading list for today, I really enjoyed this :) As someone who’s fluent (imo) in sarcasm, the ‘Reliability Champion’ example is such a brilliant illustration of how AI can misread the most human parts of communication. I completely agree that context is focus, not capacity.
On your point about uncertainty, I keep thinking, we say we want models to ask clarifying questions, but most products quietly punish ‘friction’. The fast answer gets rewarded, the careful question gets labelled annoying. So the design problem isn’t only model behaviour, it’s what we teach users to tolerate.
I’d rather an assistant be occasionally ‘slow and slightly annoying’ than smooth and confidently wrong, especially in work contexts where misunderstandings have real social consequences. How do you see teams making that trade-off without getting overruled by speed metrics?
Really great piece. Looking forward to reading more :)