Why Talkative AI Agents Can Be a Disaster in Finance?

In the world of LLMs, we often hear that “Multi-step Reasoning” or “Multi-agent Architectures” can significantly improve performance. For scenarios involving creativity and brainstorming, allowing Insights Agents to “elaborate more” indeed expands the semantic boundary and sparks unexpected insights.

However, in “quantitative finance validation” scenarios, where extreme precision is required, over-elaboration might actually make the performance terrible.

I recently read 《A Multi-Agent Framework for Quantitative Finance》, a paper published by JPMorgan at EMNLP 2025. This paper proposes a framework comprising “Insights Agents” such as a Data Summarizer, Finance Expert, and Query Refiner, attempting to enhance the performance of a Base Agent by incorporating financial knowledge and data preprocessing.

While this research academically explores the possibilities of complex architectures, from a practical perspective, I must say something strong: This is definitely not a production system actually running internally at JPMorgan.

The reason is simple: The accuracy (Pass@1) is too low.

1. 46% Accuracy: This is Just a Baseline

According to the paper’s data, even with the addition of so many Insights Agents, the overall Pass@1 accuracy is only 46% (compared to 39.59% for a single agent). When dealing with “Hard” level financial problems, the accuracy is even more abysmal. For quantitative finance, a field where “a single sign error can lead to a huge deviation,” such performance is far from acceptable.

2. The “Reflection Trap”

The paper mentions an interesting phenomenon: in some simple (Easy) tasks, the Pass@5 or Pass@10 metrics after adding a reflection mechanism were actually lower than without it. This further confirms that if an Agent lacks external physical verification (e.g., rigid symbolic checks or unit tests), it is merely “confidently correcting errors haphazardly,” often making things worse.

3. The Danger of Lacking Ground Truth

In real business scenarios, if we don’t have ground truth, this “multi-agent, verbose” architecture can be extremely misleading. It will spew out a lot of seemingly professional financial jargon, generated code, and detailed reflection logs, creating an illusion for the user that “it understands a lot.” But when you delve into its computational logic, you might find that it has even misinterpreted field definitions.

Core Reflection: Financial validation requires Symbolic Precision and rigorous logical verification, not more flamboyant semantic padding. We cannot rely on LLMs’ “linguistic talent” to solve problems of “logical computation.”

This type of paper serves as a good baseline to understand the limitations of agent architectures, but on the path to AGI, we need a more hardcore validation framework, rather than just making agents more “talkative.”

#LLM #AgenticAI #QuantitativeFinance #JPMorgan #EMNLP #LLMOps #AIVerification #VerifiQuant

1. 46% Accuracy: This is Just a Baseline#

2. The “Reflection Trap”#

3. The Danger of Lacking Ground Truth#

1. 46% Accuracy: This is Just a Baseline

2. The “Reflection Trap”

3. The Danger of Lacking Ground Truth