Every serious investor is running the same experiment right now.
Someone at the desk opens ChatGPT, Gemini, Claude, or Perplexity and types a query that sounds trivial:
“Which NBFCs mentioned write-offs in their Q2 FY26 disclosures?”
The answer arrives instantly.
Clean bullets. Confident phrasing. Perfect fluency.
For a moment, it feels like the morning meeting just automated itself.
Then you check the sources:
It’s structurally unreliable.
This is the point when teams realize the core mismatch:
General LLMs optimise for answering. Financial research requires verifying.
And forcing these models into a finance workflow exposes every architectural fault line underneath.
The natural reaction after a few bad answers is:
“Fine, I’ll upload the Annual Report myself. That will force the model to read the right data.”
And it works for a single company, a single document, and a single question.
But real research isn’t a one-document workflow.
The moment you try to replicate what analysts actually do, three immovable walls appear.
A serious coverage universe isn’t one PDF.
It’s:
Upload everything and ask, general models start blurring.
You’re asking a probabilistic text generator to maintain multi-year continuity across thousands of messy tokens.
It simply wasn’t designed for that.
Even when an answer looks right, verification fails.
You click the citation:
If every AI-generated answer still miss verifiability , nothing has been automated — the load has just been reshaped.
Your actual workflow isn’t about one company — it’s a sector.
To answer the original NBFC question reliably, you need:
20 companies × 10–20 filings each = hundreds of PDFs, thousands of pages.
General LLMs break here for three reasons:
You can upload a handful of documents per prompt or a limited set per workspace — never your full universe.
Even long-context models cannot maintain precise multi-year relationships across hundreds of documents.
General LLMs treat each PDF as an isolated island.
Financial research requires a timeline, not a file list.
At this scale, “chat with PDF” is a good reading tool but not a research system.
| Use Case | Companies | Documents (3-Year Window) | General LLM Behaviour | Actual Failure Mode |
|---|---|---|---|---|
| Single Stock | \~1 | 50–100 • AR (3) • Credit Rating (6) • Concalls (12) • Presentations (12) • Announcements (60) | Partially Works | Local inconsistencies: standalone vs consolidated mixing, guidance/actual blending, missing middle pages, weak citations |
| Sector View | 30–50 | 3,000–5,000 | Not Operationally Possible | Structural failures: upload limits, attention decay, timeline confusion, no cross-company continuity |
| Market View | 500+ | 50,000+ | Not Operationally Possible | Architecture ceiling: cannot ingest full universes, cannot maintain multi-year links, cannot aggregate or compare at scale |
Even if hallucinations vanished and citations were perfect, chat still comes with constrains.
It’s not often that analysts work on one-off questions. They execute workflows:
Chat is linear. Workflows are cyclical, persistent, and stateful.
If you need 20 prompts to recreate a quarterly update, you haven’t automated the work — you’ve turned research into a command line.
| Layer | General LLM Behaviour | Finance Requirement | Mismatch |
|---|---|---|---|
| Training Objective | Predict next token (probabilistic completion) | Retrieve exact fact from filings | Fluency over accuracy |
| Memory / Context | Long, but lossy (compression + attention decay) | Multi-year continuity across filings | “Lost in the middle” problem |
| Citation | File-level links, unreliable grounding | Line-level, page-level evidence mapping | No positional metadata |
| OCR / Parsing | Generic OCR + linear text extraction | Layout-aware parsing for tables, scans, footnotes | Breaks on Indian PDFs & scanned filings |
| Indexing | Each file is isolated (File_ID) | Entity-level timeline, cross-document graph | No cross-document coherence |
| Workflow | Linear chat → one-off Q\&A | Cyclical research loops (screen, compare, monitor) | Interface mismatch |
Most AI failures in finance have nothing to do with “bad prompts.”
They come from deep architectural constraints that make general LLMs fundamentally misaligned with statutory filings.
Below is a breakdown of the actual failure modes.
General LLMs are next-token predictors, not fact retrieval engines.
When a number is missing, they generate the statistically likely value, not the true one.
Consequence:
Hallucinations aren’t anomalies, they’re a built-in behavior of the architecture.
Requirement for finance:
A retrieval-first engine that is hard-coded to return “Not disclosed” when a fact is absent, instead of predicting a replacement.
LLMs are trained via RLHF to be helpful, not precise.
They avoid null answers because non-answers are penalised during training.
They cannot differentiate:
Consequence:
Fabricated positives or irrelevant policies appear where regulatory silence is the actual fact.
Requirement for finance:
A system that performs container-level scans (e.g., Notes on Borrowings) and can return deterministic absence-as-information.
Even 1M-token models exhibit non-linear attention decay.
LLMs overweight the beginning and end of long contexts and underweight the middle.
Dense financial tables are the worst-case input: high precision, low redundancy, zero tolerance for interpolation.
Consequence:
Upload 2,000 pages → the model confuses years, interpolates missing rows, or merges standalone & consolidated data.
Requirement for finance:
Extract → Normalize → Compare
Facts must be extracted and anchored before reasoning.
Real filings contain:
Generic OCR + linear parsing collapses on a large share of Indian PDFs.
Consequence:
Downstream reasoning is corrupted before the LLM even begins — extracted data is incomplete, mis-ordered, or missing rows.
Requirement for finance:
A layout-aware parser engineered for financial documents with deterministic reconstruction of tables, footnotes, and headers.
Global systems optimise for standardized filings (10-K/10-Q).
Indian filings vary massively in naming, formatting, templates, and disclosure density.
Many small- and mid-cap filings are scanned or image-heavy.
Consequence:
Critical data is missing, OCR fails, entities are misclassified, and the long tail becomes invisible.
Requirement for finance:
Direct exchange-level ingestion (BSE/NSE) with India-specific parsing rules and entity resolution.
The fundamental mistake is assuming the unit of work is a PDF.
In finance, the unit of work is:
A finance engine shouldn’t start reading when you ask a question.
It should have already parsed, indexed, structured, and linked the entire market before you log in.
General-purpose LLMs are extraordinary conversational systems.
But they were not built for:
Financial research is built on these constraints.
Until those architectural gaps close, chatbots will remain useful assistants but not research infrastructure.
CompoundingAI is an enterprise-grade vertical intelligence engine that transforms unstructured data into decision-grade insights within minutes, with source-level traceability for confident & auditable workflows across research, risk, and investment teams.
We cut the noise & directly deliver insights.
Powered by CompoundingAI — AI-powered Indian equity research