ai-ml 2026-06-13

Building StockApp: A Multi-Agent AI Equity Research Platform That Learned From Its Own Bias

Automating the complete stock analysis workflow — Excel parsing, market data, document intelligence, RAG Q&A, and daily AI recommendations — and what happened when I backtested the agent against itself.

DjangoReactGemini QdrantRAGMulti-Agent CeleryGCPTerraform

Contents

The Problem
High-Level Design & Architecture
Cloud Infrastructure
The Agents, In Detail
Multi-Agent Architecture
Why Prompts Are the Product
RAG Deep-Dive: Chunks, Dimensions, Retrieval
Daily Automation: Manual vs Automated
The Backtest That Humbled My Agent
Engineering Challenges
Pros & Cons
References

1. The Problem

Researching a single stock properly takes hours every day: downloading financials from Screener.in, checking price action against benchmarks, reading 30-page investor presentations, parsing concall transcripts for management tone, scanning news, and finally synthesizing all of it into a decision. Multiply that across a watchlist and it becomes a full-time job.

StockApp automates the entire loop. Upload a Screener Excel export once; the system handles everything else daily — including downloading new documents the moment they appear, analyzing them with LLM agents, and issuing a recommendation that knows your actual portfolio position and its own track record.

2. High-Level Design & Architecture

┌──────────────────────────────── GCP ─────────────────────────────────┐
│                                                                       │
│   Cloud Run (Frontend)           Cloud Run (Backend + Worker)         │
│   ┌───────────────┐   HTTPS     ┌─────────────────────────────────┐  │
│   │  React + Vite │────────────►│  Django + DRF  (JWT, opt. 2FA)  │  │
│   │  Tailwind     │             │  ┌───────────────────────────┐  │  │
│   │  Framer Motion│             │  │  Celery Workers (Agents)  │  │  │
│   └───────────────┘             │  │  · Excel Parser           │  │  │
│                                 │  │  · Price/Performance      │  │  │
│   Cloud Scheduler ──► daily ───►│  │  · IP Analysis Agent      │  │  │
│   (06:00 IST)        pipeline   │  │  · Concall Agent          │  │  │
│                                 │  │  · News Agent (grounded)  │  │  │
│                                 │  │  · Screener Doc Sync      │  │  │
│                                 │  │  · Master Reco Agent      │  │  │
│                                 │  │  · Evaluator / Backtester │  │  │
│                                 │  └───────────────────────────┘  │  │
│                                 └──────────────┬──────────────────┘  │
│                                                │ VPC Connector        │
│   ┌──────────────── Compute Engine VM ─────────▼──────────────────┐  │
│   │   MySQL 8            Redis              Qdrant                │  │
│   │   (facts)            (broker +          (semantic vectors)    │  │
│   │   metrics, prices,    session cache)    excel_financials,     │  │
│   │   recos, portfolio                      documents             │  │
│   └────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│   GCS: raw Excel/PDF artifacts  ·  Secret Manager: keys & TOTP        │
└───────────────────────────────────────────────────────────────────────┘
  External: screener.in · 5paisa OpenAPI · Gemini API · Google Search

Figure 1 — System architecture: stateless compute on Cloud Run, stateful data plane on a GCE VM.

Core design principle

MySQL stores facts. Qdrant stores meaning. The LLM never computes. Exact numbers (prices, metrics, targets) live in relational tables for deterministic lookups. Semantic content lives in vector space for fuzzy retrieval. All math — returns, RSI, MACD, ratios — is computed in pandas; the LLM only interprets pre-computed values. This single rule eliminated an entire class of hallucinated-arithmetic bugs.

Technology Stack

Layer	Choice	Why
Frontend	React 18, Vite, Tailwind CSS 4, Framer Motion, Recharts	Fast iteration, dark fintech aesthetic, micro-animations
Backend	Django 5 + DRF + SimpleJWT	Mature ORM/auth; admin for free
Async	Celery 5 + Redis	PDF analysis runs 30–90s; can't block HTTP
Databases	MySQL 8 + Qdrant	Facts vs meaning separation
LLM	Gemini 2.5 Pro / Flash	Native PDF input, JSON mode, search grounding
Embeddings	gemini-embedding-001 (auto-detected)	Survived two model deprecations mid-project
Market data	5paisa OpenAPI	Stock + index candles, TOTP session
Infra	Terraform + Cloud Build + Cloud Scheduler	Env-separated IaC, daily cron

3. Cloud Infrastructure

The deployment splits cleanly along the stateless/stateful boundary:

Cloud Run hosts three services: the React frontend, the Django API (scale-to-zero), and the Celery worker (pinned min_instances=1 with internal-only ingress — workers must always be listening to the queue).
A single GCE VM runs MySQL, Redis, and Qdrant via docker-compose on an attached SSD persistent disk. Firewall rules allow ports 3306/6379/6333 only from the serverless VPC connector range — the data plane has no public surface.
Cloud Scheduler fires the daily pipeline at 06:00 IST via an OIDC-authenticated HTTP call; Cloud Run IAM ensures only the scheduler's service account can invoke it.
Secret Manager holds everything sensitive: Django secret, DB password, Gemini key, and the 5paisa credential set including the TOTP seed URI.
Terraform is organized as reusable modules (network, datavm, cloudrun, scheduler, storage, secrets) with separate dev/prod tfvars; Cloud Build runs plan on PR and apply on merge.

Honest trade-off

Co-locating MySQL/Redis/Qdrant on one VM is a cost decision, not a best practice. The mitigations are disk snapshots and a startup script that self-heals the compose stack; the upgrade path is Cloud SQL + Memorystore + Qdrant Cloud when scale demands it.

4. The Agents, In Detail

4.1 Excel Parser (deterministic "agent")

Not an LLM — and that's the point. Screener exports have a fixed template (P&L, Quarters, Balance Sheet, Cash Flow all formula-derived from a master Data Sheet), so a schema-aware state machine beats any generic approach. It walks Data Sheet sections (META → PROFIT & LOSS → Quarters → BALANCE SHEET → CASH FLOW → PRICE → DERIVED), resets period headers at each Report Date row, and emits ~130 typed metric records.

The bug that started it all

My first parser piped values through pd.to_datetime(errors="coerce"). Pandas interpreted every numeric as nanoseconds since epoch — so Sales of ₹14,600 Cr became 1970-01-01. Every single financial figure was silently destroyed. Lesson: never datetime-coerce columns that contain numbers.

4.2 Price & Performance Fetcher

Authenticates to 5paisa with TOTP (session cached in Redis, 6h TTL), pulls ~400 days of candles for the stock and every benchmark index, then computes YTD/1W/1M/3M/6M/9M returns and — the metric that matters — alpha (stock return minus index return) per benchmark, with sector auto-mapping (Auto → NIFTY AUTO, IT → NIFTY IT).

Silent data corruption, round two

My assumed index scrip codes were wrong in the most dangerous way: they returned data — just for the wrong index. 999920016 wasn't NIFTY AUTO, it was NIFTY PSE; 999920019 wasn't NIFTY PHARMA, it was INDIA VIX. Always verify against the scrip master; plausible-looking data is worse than an exception.

4.3 Investor Presentation Agent

Sends the raw PDF to Gemini (inline bytes ≤19MB; Files API above) with response_mime_type: application/json and a fixed extraction schema: about, revenue mix by product/channel/region, regional financials, highlights, growth plans, capex, government incentives, new products. Gemini reads the charts and tables inside the deck images — no OCR pipeline needed. Apollo Tyres' Q4 deck extracted in 19 seconds with revenue splits accurate to the percentage point.

4.4 Concall Transcript Agent

Same PDF mechanism, different schema: it separates management commentary from analyst Q&A, classifies tone (BULLISH → BEARISH), and extracts demand outlook, margin commentary, pricing actions, explicit guidance, risks, key Q&A pairs, and up to three verbatim quotes. The output of 16 quarters of these became the corpus for cross-quarter RAG questions.

4.5 News Agent (Search-Grounded)

Gemini 2.5 Flash with the google_search tool finds the last 7 days of company news and returns sentiment-tagged JSON. Two production gotchas: grounded responses wrap JSON in prose (solved with a tolerant first-[-to-last-] extractor), and grounding redirect URLs exceed 500 characters (solved by switching URLField → TextField).

4.6 Screener Document Sync

Scrapes the Documents section of the company's Screener page, classifies links by label (transcript/presentation/annual report), validates downloads by %PDF magic bytes, dedupes against a URL ledger, and auto-queues analysis. First run it discovered and processed Apollo's entire concall archive unattended — which also taught me to add per-run caps and 429 retry-with-backoff.

4.7 Master Recommendation Agent

The synthesis layer (Gemini 2.5 Pro). Inputs: fundamentals summary from MySQL, price and alpha data, pre-computed technicals, latest IP + concall insights, news, the user's actual position, and the agent's own evaluated track record. Output: a strict JSON contract — action, confidence, a mandatory bear case, short/long-term targets with stop-loss and timelines, four-pillar rationale, key risks, and portfolio-specific sizing advice.

4.8 Evaluator & Backtester

Every recommendation is scored daily against real candles: target hit? stop-loss hit? maximum favorable/adverse excursion? The backtester goes further — monthly point-in-time replays (candles truncated to the as-of date; only documents published before that date) scored against actual forward 1M/3M/6M returns, bucketed by action and confidence level.

5. Multi-Agent Architecture

This is a hub-and-spoke design, not a free-form agent swarm. Specialist agents run independently and persist structured outputs; the master agent consumes those artifacts — never raw inputs.

            ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
            │ Excel      │  │ Price/Perf │  │ IP Agent   │  │ Concall    │
            │ Parser     │  │ + Technicals│ │ (PDF→JSON) │  │ Agent      │
            └─────┬──────┘  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘
                  │               │               │               │
              MySQL +         MySQL          MySQL + Qdrant   MySQL + Qdrant
              Qdrant            │               │               │
                  │               │               │               │
            ┌─────┴───────────────┴───────────────┴───────────────┴─────┐
            │                  STRUCTURED ARTIFACT LAYER                 │
            │     (typed rows + JSON insights + embedded chunks)         │
            └─────┬───────────────────────────────────────────────┬─────┘
                  │                                               │
        ┌─────────▼─────────┐   ┌──────────────┐        ┌─────────▼────────┐
        │  MASTER RECO      │◄──│ News Agent   │        │  RAG Chat        │
        │  AGENT            │   │ (grounded)   │        │  (analyst Q&A)   │
        │  + Confidence     │   └──────────────┘        └──────────────────┘
        │    Governor (code)│
        └─────────┬─────────┘
                  │
        ┌─────────▼─────────┐         ┌──────────────────┐
        │  Recommendation   │────────►│  EVALUATOR        │──┐
        │  (persisted)      │         │  scores vs actual │  │ feedback loop:
        └───────────────────┘         │  prices daily     │  │ track record
                  ▲                   └──────────────────┘  │ injected into
                  └─────────────────────────────────────────┘ next prompt

Figure 2 — Hub-and-spoke agents with a persistent artifact layer and a self-evaluation feedback loop.

Why this shape instead of agent-to-agent messaging?

Debuggability — every intermediate artifact is inspectable in MySQL/Qdrant. When the recommendation looks wrong, you can see exactly which input misled it.
Independent failure domains — a news-agent outage doesn't block the recommendation; it degrades gracefully to "News: None".
Cost control — expensive extractions (PDF analysis) run once and are reused by both the master agent and the chat, instead of re-reading PDFs per query.
The feedback loop is the differentiator — the evaluator's scorecard flows back into the master agent's context, making it the only agent in the system that reads its own report card.

6. Why Prompts Are the Product

The same model, same data, same architecture produced opposite recommendations depending purely on prompt design. Three prompt engineering decisions mattered more than any infrastructure choice:

① Fixed JSON schemas turn LLMs into reliable parsers

Every extraction agent receives an exact output contract ("Return STRICT JSON with EXACTLY this structure, use null/[] if not found") combined with response_mime_type: application/json. Across 60+ document analyses, schema adherence was effectively 100% — the unreliable part of LLMs isn't structure, it's unconstrained generation.

② Source skepticism must be explicit

The original master prompt treated concall insights as ground truth. Result: the agent absorbed management's optimism wholesale. The fix was adversarial framing baked into the prompt: "Management commentary is INHERENTLY PROMOTIONAL... Treat guidance as marketing until corroborated by reported numbers" — plus a mandatory bear_case field: if the model can't argue against its own call, it hasn't analyzed the stock.

③ Self-evaluation in-context changes behavior

The most surprising finding. Injecting the agent's own track record — "Your high-confidence calls averaged −5.3% vs −0.4% for low-confidence ones. You issued BUY 8/8 times including at the cycle top." — measurably recalibrated it. On identical live data, the recommendation shifted from ADD @ 75% confidence to HOLD @ 50% with an articulate value-trap bear case.

Defense in depth

Prompts are suggestions; code is law. A Python confidence governor enforces hard caps after generation (RSI > 65 → cap 55; price >10% above 50DMA → cap 55; counter-trend buys without momentum confirmation → cap 60) regardless of what the model claims.

7. RAG Deep-Dive: Chunks, Dimensions, Retrieval

The Flow

 INGESTION                                          RETRIEVAL
 ─────────                                          ─────────
 ┌─────────────┐  ┌──────────────┐                 ┌─────────────────┐
 │ Excel       │  │ IP / Concall │                 │ User question    │
 │ (Screener)  │  │ PDFs         │                 │ "How has margin  │
 └──────┬──────┘  └──────┬───────┘                 │  guidance        │
        │                │                          │  evolved?"       │
 schema-aware     Gemini structured                 └────────┬────────┘
 parser           extraction (JSON)                          │
        │                │                          embed (gemini-
        ▼                ▼                          embedding-001)
 ┌─────────────────────────────┐                            │
 │  SEMANTIC CHUNK BUILDER      │                            ▼
 │  · 1 chunk per metric        │              ┌──────────────────────────┐
 │  · 1 chunk per insight       │              │ Qdrant query_points       │
 │    section                   │              │ payload filter:           │
 │  · headers baked into        │              │   stock_id = N            │
 │    every chunk               │              │ + cosine similarity       │
 │  · auto YoY/QoQ stats        │              │ across BOTH collections   │
 └──────────┬──────────────────┘              └────────────┬─────────────┘
            │                                              │
   embed (gemini-embedding-001,                   top-k chunks (5 + 5)
   task_type=retrieval_document)                           │
            │                                              ▼
            ▼                                  ┌──────────────────────────┐
 ┌─────────────────────────────┐              │ Gemini 2.5 Flash          │
 │ QDRANT                       │              │ context-locked prompt:    │
 │ · excel_financials           │              │ "Answer ONLY from         │
 │ · documents                  │              │  context. Quote exact     │
 │ vectors + rich payload       │              │  numbers & periods.       │
 │ (stock_id, statement_type,   │              │  State units."            │
 │  fiscal_years, doc_type,     │              └────────────┬─────────────┘
 │  period, metric, raw_text)   │                           ▼
 └─────────────────────────────┘                    Grounded answer

Figure 3 — End-to-end RAG flow: semantic chunking on ingest, hybrid filtered retrieval, context-locked generation.

Chunking Strategy: Semantic Units, Not Token Windows

Standard RAG advice says "512-token chunks with 50-token overlap." For financial data this is exactly wrong: a token window slices a table row away from its column headers, producing orphaned numbers no embedding can rescue. StockApp chunks by semantic unit instead:

Financial chunks — one per line item, carrying the company name, statement type, unit declaration, every period:value pair, plus auto-computed latest YoY/QoQ change and min/max across periods.
Document chunks — one per insight section (demand outlook, margin commentary, capex…), prefixed with company / document type / period.

Company: Apollo Tyres Ltd (NSE/BSE listed)
Financial Statement: Income Statement
Metric: Sales
Unit: INR Crores
Values:
  Mar-2017: 13180.04
  Mar-2018: 14842.85
  ...
  Mar-2025: 26123.42
Latest YoY change (Mar-2024 → Mar-2025): +2.9%
Highest: 26123.42 | Lowest: 13180.04 across 9 periods

Why meaning survives

Chunk overlap: zero — by design. Overlap exists to repair context broken by arbitrary splitting; if chunks are born self-contained, there is nothing to repair. Every number lives next to its label, period, and unit, so the question "sales in FY2023" and its answer occupy the same embedding neighborhood. Relationships across chunks (e.g., margin trends across quarters) are reconstructed at retrieval time by the LLM from multiple complete chunks — which is what LLMs are good at.

Embedding Model & Dimensions

Aspect	Implementation
Model	`gemini-embedding-001` — selected at runtime from a candidate list, because `text-embedding-004` was deprecated mid-project
Dimensions	Discovered dynamically from the first embedding batch; the Qdrant collection is created (or rebuilt) to match — model swaps don't break the store
Distance	Cosine
Task types	`retrieval_document` at ingest, `retrieval_query` at question time (asymmetric optimization)
Batching	100 texts per API call

Retrieval Mechanism: Hybrid, Filtered, Multi-Collection

Pure vector search fails financial queries in predictable ways — "Q3 sales" matches annual sales chunks; one stock's metrics bleed into another's. The fix is filter-then-rank:

Payload pre-filter — Qdrant indexed fields (stock_id, statement_type, fiscal_years, doc_type) constrain the candidate set before similarity ranking. Multi-stock isolation is a hard filter, not a hope.
Cosine top-k across both collections (financials + documents), 5 hits each, merged into ≤10 context chunks.
Context-locked generation — the answer prompt forbids outside knowledge, requires exact figures with period labels, and forces unit disambiguation (Excel = ₹ Crores; presentations often ₹ Mn).

The payoff: asked "How has margin guidance evolved over the last few quarters?", the system pulled chunks spanning Q1 FY25 through Q4 FY26 concalls and synthesized the guidance arc — long-term >15% EBITDA target, raw-material headwinds, plant-closure tailwinds — with every claim traceable to a retrieved chunk.

8. Daily Automation: Eliminating the Manual Grind

One Cloud Scheduler trigger at 06:00 IST runs the full pipeline per stock: prices → performance → news → document sync → recommendation evaluation → fresh recommendation. Here's what that replaces:

Task	Manual Workflow	StockApp Automated	Time Saved/Day
Price & benchmark check	Open broker app, check stock, open NSE site, compare 4 indices, compute alpha by hand	5paisa API + computed alpha across all benchmarks	~15 min
Financials review	Re-download Screener Excel, eyeball trends	Parsed once; trends & YoY pre-computed; chatbot for any lookup	~20 min
New document detection	Check Screener/BSE manually for new filings	Scraper detects, downloads, validates, dedupes	~10 min
Presentation analysis	Read 30–40 slides, take notes	Structured extraction in ~20s: revenue mix, capex, guidance	~45 min
Concall analysis	Read/listen 60–90 min, separate signal from PR	Tone, guidance, risks, key Q&A extracted; PR-skepticism built into downstream prompt	~75 min
News scan	Google, filter noise, judge sentiment	Search-grounded agent, sentiment-tagged, deduped	~15 min
Synthesis & decision	Hold it all in your head; vulnerable to recency & confirmation bias	Master agent with mandatory bear case, calibrated confidence, portfolio-aware sizing	~30 min
Track-record review	Almost nobody does this honestly	Automatic daily scoring vs actual prices; fed back into the agent	priceless
Total (per stock)	~3.5 hours	~4 minutes of compute	~98%

And the automated version scales linearly with watchlist size, never skips a step on a busy day, and never "forgets" to log whether yesterday's call worked out.

9. The Backtest That Humbled My Agent

Before trusting the recommendations, I built a harness that replays the agent at historical month-ends with strictly point-in-time data, then scores each call against real forward returns. The verdict on 8 monthly replays:

Agent avg forward 3M:      −3.43%
Buy-and-hold avg 3M:       −3.43%   ← identical: ZERO selection edge
Action distribution:        BUY 8/8  ← permabull
High-confidence (≥70) avg:  −5.28%, 40% target-hit
Low-confidence  (<70) avg:  −0.36%, 67% target-hit  ← confidence INVERTED
Highest-confidence call:    80% conf, at the exact cycle top, −14.7% fwd 3M

Diagnosis

The agent was a momentum-following bull with inversely-calibrated confidence. Cause: concall transcripts are maximally optimistic at tops, and the prompt treated them as truth. The only thing saving the strategy was its stop-losses.

The fix — and the measured result

Three layers: adversarial prompt framing + mandatory bear case, the agent's own scorecard injected into every prompt, and a code-level confidence governor. On identical live data the recommendation moved from ADD @ 75% to HOLD @ 50% with this bear case: "…the stock is a value trap in a persistent downtrend. Any technical bounce will be short-lived… as margin compression continues." An agent that knows its own failure modes behaves differently.

10. Engineering Challenges

Challenge	Root Cause	Resolution
All financials became `1970-01-01`	pandas datetime coercion of numerics	Type-strict cleaning; only true datetime cells converted
Two Gemini models deprecated mid-build	Provider model churn	Runtime model auto-detection with candidate fallback lists
Benchmark data was the wrong index	Assumed scrip codes; NIFTY AUTO ↔ NIFTY PSE swap	Scrip-master verification + empty-frame guards
News items silently dropped	Grounding redirect URLs > 200-char URLField	TextField + truncation; per-item exception isolation
One bad import killed the whole API	Module-level imports of optional services	Function-level imports for non-critical paths
60+ document flood on first sync	Unbounded archive scrape	Per-run cap, coverage cap, 429 retry with backoff
Agent permabull bias	Promotional source material taken at face value	Prompt discipline + scorecard feedback + code governor

11. Pros & Cons

Pros

~98% reduction in daily research time, scaling linearly across a watchlist
Deterministic math + LLM interpretation = no hallucinated numbers
16 quarters of document memory queryable in natural language
Built-in honesty: every call scored against reality, biases measured not assumed
Self-correcting loop — the agent reads its own report card
Hybrid retrieval prevents cross-stock contamination by construction
Modular agents fail independently; pipeline degrades gracefully

Cons & Limitations

n=8 backtest on one stock is evidence of process, not statistical edge
Mild lookahead leakage: document insights generated by today's extraction prompts
LLM API costs scale with document volume; model deprecations require maintenance
Screener scraping is brittle vs bot protection; exchange APIs are the robust path
Single-VM data plane is a cost compromise (snapshot-mitigated)
Recommendations are decision support, not financial advice — the human stays in the loop

12. References

Qdrant — Filtering & Hybrid Search: qdrant.tech/documentation/concepts/filtering
Gemini API — Document (PDF) Understanding: ai.google.dev/gemini-api/docs/document-processing
Gemini API — Grounding with Google Search: ai.google.dev/gemini-api/docs/grounding
Gemini API — Embeddings & task types: ai.google.dev/gemini-api/docs/embeddings
5paisa OpenAPI documentation: 5paisa.com/developerapi
Celery — Tasks, retries & backoff: docs.celeryq.dev
Django REST Framework: django-rest-framework.org
Cloud Run + Cloud Scheduler (OIDC invocation): cloud.google.com/run/docs/triggering/using-scheduler
Terraform Google Provider: registry.terraform.io
Screener.in — financial data source: screener.in