Building StockApp: A Multi-Agent AI Equity Research Platform That Learned From Its Own Bias

Automating the complete stock analysis workflow — Excel parsing, market data, document intelligence, RAG Q&A, and daily AI recommendations — and what happened when I backtested the agent against itself.

DjangoReactGemini QdrantRAGMulti-Agent CeleryGCPTerraform

1. The Problem

Researching a single stock properly takes hours every day: downloading financials from Screener.in, checking price action against benchmarks, reading 30-page investor presentations, parsing concall transcripts for management tone, scanning news, and finally synthesizing all of it into a decision. Multiply that across a watchlist and it becomes a full-time job.

StockApp automates the entire loop. Upload a Screener Excel export once; the system handles everything else daily — including downloading new documents the moment they appear, analyzing them with LLM agents, and issuing a recommendation that knows your actual portfolio position and its own track record.

2. High-Level Design & Architecture

┌──────────────────────────────── GCP ─────────────────────────────────┐
│                                                                       │
│   Cloud Run (Frontend)           Cloud Run (Backend + Worker)         │
│   ┌───────────────┐   HTTPS     ┌─────────────────────────────────┐  │
│   │  React + Vite │────────────►│  Django + DRF  (JWT, opt. 2FA)  │  │
│   │  Tailwind     │             │  ┌───────────────────────────┐  │  │
│   │  Framer Motion│             │  │  Celery Workers (Agents)  │  │  │
│   └───────────────┘             │  │  · Excel Parser           │  │  │
│                                 │  │  · Price/Performance      │  │  │
│   Cloud Scheduler ──► daily ───►│  │  · IP Analysis Agent      │  │  │
│   (06:00 IST)        pipeline   │  │  · Concall Agent          │  │  │
│                                 │  │  · News Agent (grounded)  │  │  │
│                                 │  │  · Screener Doc Sync      │  │  │
│                                 │  │  · Master Reco Agent      │  │  │
│                                 │  │  · Evaluator / Backtester │  │  │
│                                 │  └───────────────────────────┘  │  │
│                                 └──────────────┬──────────────────┘  │
│                                                │ VPC Connector        │
│   ┌──────────────── Compute Engine VM ─────────▼──────────────────┐  │
│   │   MySQL 8            Redis              Qdrant                │  │
│   │   (facts)            (broker +          (semantic vectors)    │  │
│   │   metrics, prices,    session cache)    excel_financials,     │  │
│   │   recos, portfolio                      documents             │  │
│   └────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│   GCS: raw Excel/PDF artifacts  ·  Secret Manager: keys & TOTP        │
└───────────────────────────────────────────────────────────────────────┘
  External: screener.in · 5paisa OpenAPI · Gemini API · Google Search
Figure 1 — System architecture: stateless compute on Cloud Run, stateful data plane on a GCE VM.
Core design principle
MySQL stores facts. Qdrant stores meaning. The LLM never computes. Exact numbers (prices, metrics, targets) live in relational tables for deterministic lookups. Semantic content lives in vector space for fuzzy retrieval. All math — returns, RSI, MACD, ratios — is computed in pandas; the LLM only interprets pre-computed values. This single rule eliminated an entire class of hallucinated-arithmetic bugs.

Technology Stack

LayerChoiceWhy
FrontendReact 18, Vite, Tailwind CSS 4, Framer Motion, RechartsFast iteration, dark fintech aesthetic, micro-animations
BackendDjango 5 + DRF + SimpleJWTMature ORM/auth; admin for free
AsyncCelery 5 + RedisPDF analysis runs 30–90s; can't block HTTP
DatabasesMySQL 8 + QdrantFacts vs meaning separation
LLMGemini 2.5 Pro / FlashNative PDF input, JSON mode, search grounding
Embeddingsgemini-embedding-001 (auto-detected)Survived two model deprecations mid-project
Market data5paisa OpenAPIStock + index candles, TOTP session
InfraTerraform + Cloud Build + Cloud SchedulerEnv-separated IaC, daily cron

3. Cloud Infrastructure

The deployment splits cleanly along the stateless/stateful boundary:

  • Cloud Run hosts three services: the React frontend, the Django API (scale-to-zero), and the Celery worker (pinned min_instances=1 with internal-only ingress — workers must always be listening to the queue).
  • A single GCE VM runs MySQL, Redis, and Qdrant via docker-compose on an attached SSD persistent disk. Firewall rules allow ports 3306/6379/6333 only from the serverless VPC connector range — the data plane has no public surface.
  • Cloud Scheduler fires the daily pipeline at 06:00 IST via an OIDC-authenticated HTTP call; Cloud Run IAM ensures only the scheduler's service account can invoke it.
  • Secret Manager holds everything sensitive: Django secret, DB password, Gemini key, and the 5paisa credential set including the TOTP seed URI.
  • Terraform is organized as reusable modules (network, datavm, cloudrun, scheduler, storage, secrets) with separate dev/prod tfvars; Cloud Build runs plan on PR and apply on merge.
Honest trade-off
Co-locating MySQL/Redis/Qdrant on one VM is a cost decision, not a best practice. The mitigations are disk snapshots and a startup script that self-heals the compose stack; the upgrade path is Cloud SQL + Memorystore + Qdrant Cloud when scale demands it.

4. The Agents, In Detail

4.1 Excel Parser (deterministic "agent")

Not an LLM — and that's the point. Screener exports have a fixed template (P&L, Quarters, Balance Sheet, Cash Flow all formula-derived from a master Data Sheet), so a schema-aware state machine beats any generic approach. It walks Data Sheet sections (META → PROFIT & LOSS → Quarters → BALANCE SHEET → CASH FLOW → PRICE → DERIVED), resets period headers at each Report Date row, and emits ~130 typed metric records.

The bug that started it all
My first parser piped values through pd.to_datetime(errors="coerce"). Pandas interpreted every numeric as nanoseconds since epoch — so Sales of ₹14,600 Cr became 1970-01-01. Every single financial figure was silently destroyed. Lesson: never datetime-coerce columns that contain numbers.

4.2 Price & Performance Fetcher

Authenticates to 5paisa with TOTP (session cached in Redis, 6h TTL), pulls ~400 days of candles for the stock and every benchmark index, then computes YTD/1W/1M/3M/6M/9M returns and — the metric that matters — alpha (stock return minus index return) per benchmark, with sector auto-mapping (Auto → NIFTY AUTO, IT → NIFTY IT).

Silent data corruption, round two
My assumed index scrip codes were wrong in the most dangerous way: they returned data — just for the wrong index. 999920016 wasn't NIFTY AUTO, it was NIFTY PSE; 999920019 wasn't NIFTY PHARMA, it was INDIA VIX. Always verify against the scrip master; plausible-looking data is worse than an exception.

4.3 Investor Presentation Agent

Sends the raw PDF to Gemini (inline bytes ≤19MB; Files API above) with response_mime_type: application/json and a fixed extraction schema: about, revenue mix by product/channel/region, regional financials, highlights, growth plans, capex, government incentives, new products. Gemini reads the charts and tables inside the deck images — no OCR pipeline needed. Apollo Tyres' Q4 deck extracted in 19 seconds with revenue splits accurate to the percentage point.

4.4 Concall Transcript Agent

Same PDF mechanism, different schema: it separates management commentary from analyst Q&A, classifies tone (BULLISH → BEARISH), and extracts demand outlook, margin commentary, pricing actions, explicit guidance, risks, key Q&A pairs, and up to three verbatim quotes. The output of 16 quarters of these became the corpus for cross-quarter RAG questions.

4.5 News Agent (Search-Grounded)

Gemini 2.5 Flash with the google_search tool finds the last 7 days of company news and returns sentiment-tagged JSON. Two production gotchas: grounded responses wrap JSON in prose (solved with a tolerant first-[-to-last-] extractor), and grounding redirect URLs exceed 500 characters (solved by switching URLFieldTextField).

4.6 Screener Document Sync

Scrapes the Documents section of the company's Screener page, classifies links by label (transcript/presentation/annual report), validates downloads by %PDF magic bytes, dedupes against a URL ledger, and auto-queues analysis. First run it discovered and processed Apollo's entire concall archive unattended — which also taught me to add per-run caps and 429 retry-with-backoff.

4.7 Master Recommendation Agent

The synthesis layer (Gemini 2.5 Pro). Inputs: fundamentals summary from MySQL, price and alpha data, pre-computed technicals, latest IP + concall insights, news, the user's actual position, and the agent's own evaluated track record. Output: a strict JSON contract — action, confidence, a mandatory bear case, short/long-term targets with stop-loss and timelines, four-pillar rationale, key risks, and portfolio-specific sizing advice.

4.8 Evaluator & Backtester

Every recommendation is scored daily against real candles: target hit? stop-loss hit? maximum favorable/adverse excursion? The backtester goes further — monthly point-in-time replays (candles truncated to the as-of date; only documents published before that date) scored against actual forward 1M/3M/6M returns, bucketed by action and confidence level.

5. Multi-Agent Architecture

This is a hub-and-spoke design, not a free-form agent swarm. Specialist agents run independently and persist structured outputs; the master agent consumes those artifacts — never raw inputs.

            ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
            │ Excel      │  │ Price/Perf │  │ IP Agent   │  │ Concall    │
            │ Parser     │  │ + Technicals│ │ (PDF→JSON) │  │ Agent      │
            └─────┬──────┘  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘
                  │               │               │               │
              MySQL +         MySQL          MySQL + Qdrant   MySQL + Qdrant
              Qdrant            │               │               │
                  │               │               │               │
            ┌─────┴───────────────┴───────────────┴───────────────┴─────┐
            │                  STRUCTURED ARTIFACT LAYER                 │
            │     (typed rows + JSON insights + embedded chunks)         │
            └─────┬───────────────────────────────────────────────┬─────┘
                  │                                               │
        ┌─────────▼─────────┐   ┌──────────────┐        ┌─────────▼────────┐
        │  MASTER RECO      │◄──│ News Agent   │        │  RAG Chat        │
        │  AGENT            │   │ (grounded)   │        │  (analyst Q&A)   │
        │  + Confidence     │   └──────────────┘        └──────────────────┘
        │    Governor (code)│
        └─────────┬─────────┘
                  │
        ┌─────────▼─────────┐         ┌──────────────────┐
        │  Recommendation   │────────►│  EVALUATOR        │──┐
        │  (persisted)      │         │  scores vs actual │  │ feedback loop:
        └───────────────────┘         │  prices daily     │  │ track record
                  ▲                   └──────────────────┘  │ injected into
                  └─────────────────────────────────────────┘ next prompt
Figure 2 — Hub-and-spoke agents with a persistent artifact layer and a self-evaluation feedback loop.

Why this shape instead of agent-to-agent messaging?

  • Debuggability — every intermediate artifact is inspectable in MySQL/Qdrant. When the recommendation looks wrong, you can see exactly which input misled it.
  • Independent failure domains — a news-agent outage doesn't block the recommendation; it degrades gracefully to "News: None".
  • Cost control — expensive extractions (PDF analysis) run once and are reused by both the master agent and the chat, instead of re-reading PDFs per query.
  • The feedback loop is the differentiator — the evaluator's scorecard flows back into the master agent's context, making it the only agent in the system that reads its own report card.

6. Why Prompts Are the Product

The same model, same data, same architecture produced opposite recommendations depending purely on prompt design. Three prompt engineering decisions mattered more than any infrastructure choice:

① Fixed JSON schemas turn LLMs into reliable parsers

Every extraction agent receives an exact output contract ("Return STRICT JSON with EXACTLY this structure, use null/[] if not found") combined with response_mime_type: application/json. Across 60+ document analyses, schema adherence was effectively 100% — the unreliable part of LLMs isn't structure, it's unconstrained generation.

② Source skepticism must be explicit

The original master prompt treated concall insights as ground truth. Result: the agent absorbed management's optimism wholesale. The fix was adversarial framing baked into the prompt: "Management commentary is INHERENTLY PROMOTIONAL... Treat guidance as marketing until corroborated by reported numbers" — plus a mandatory bear_case field: if the model can't argue against its own call, it hasn't analyzed the stock.

③ Self-evaluation in-context changes behavior

The most surprising finding. Injecting the agent's own track record — "Your high-confidence calls averaged −5.3% vs −0.4% for low-confidence ones. You issued BUY 8/8 times including at the cycle top." — measurably recalibrated it. On identical live data, the recommendation shifted from ADD @ 75% confidence to HOLD @ 50% with an articulate value-trap bear case.

Defense in depth
Prompts are suggestions; code is law. A Python confidence governor enforces hard caps after generation (RSI > 65 → cap 55; price >10% above 50DMA → cap 55; counter-trend buys without momentum confirmation → cap 60) regardless of what the model claims.

7. RAG Deep-Dive: Chunks, Dimensions, Retrieval

The Flow

 INGESTION                                          RETRIEVAL
 ─────────                                          ─────────
 ┌─────────────┐  ┌──────────────┐                 ┌─────────────────┐
 │ Excel       │  │ IP / Concall │                 │ User question    │
 │ (Screener)  │  │ PDFs         │                 │ "How has margin  │
 └──────┬──────┘  └──────┬───────┘                 │  guidance        │
        │                │                          │  evolved?"       │
 schema-aware     Gemini structured                 └────────┬────────┘
 parser           extraction (JSON)                          │
        │                │                          embed (gemini-
        ▼                ▼                          embedding-001)
 ┌─────────────────────────────┐                            │
 │  SEMANTIC CHUNK BUILDER      │                            ▼
 │  · 1 chunk per metric        │              ┌──────────────────────────┐
 │  · 1 chunk per insight       │              │ Qdrant query_points       │
 │    section                   │              │ payload filter:           │
 │  · headers baked into        │              │   stock_id = N            │
 │    every chunk               │              │ + cosine similarity       │
 │  · auto YoY/QoQ stats        │              │ across BOTH collections   │
 └──────────┬──────────────────┘              └────────────┬─────────────┘
            │                                              │
   embed (gemini-embedding-001,                   top-k chunks (5 + 5)
   task_type=retrieval_document)                           │
            │                                              ▼
            ▼                                  ┌──────────────────────────┐
 ┌─────────────────────────────┐              │ Gemini 2.5 Flash          │
 │ QDRANT                       │              │ context-locked prompt:    │
 │ · excel_financials           │              │ "Answer ONLY from         │
 │ · documents                  │              │  context. Quote exact     │
 │ vectors + rich payload       │              │  numbers & periods.       │
 │ (stock_id, statement_type,   │              │  State units."            │
 │  fiscal_years, doc_type,     │              └────────────┬─────────────┘
 │  period, metric, raw_text)   │                           ▼
 └─────────────────────────────┘                    Grounded answer
Figure 3 — End-to-end RAG flow: semantic chunking on ingest, hybrid filtered retrieval, context-locked generation.

Chunking Strategy: Semantic Units, Not Token Windows

Standard RAG advice says "512-token chunks with 50-token overlap." For financial data this is exactly wrong: a token window slices a table row away from its column headers, producing orphaned numbers no embedding can rescue. StockApp chunks by semantic unit instead:

  • Financial chunks — one per line item, carrying the company name, statement type, unit declaration, every period:value pair, plus auto-computed latest YoY/QoQ change and min/max across periods.
  • Document chunks — one per insight section (demand outlook, margin commentary, capex…), prefixed with company / document type / period.
Company: Apollo Tyres Ltd (NSE/BSE listed)
Financial Statement: Income Statement
Metric: Sales
Unit: INR Crores
Values:
  Mar-2017: 13180.04
  Mar-2018: 14842.85
  ...
  Mar-2025: 26123.42
Latest YoY change (Mar-2024 → Mar-2025): +2.9%
Highest: 26123.42 | Lowest: 13180.04 across 9 periods
Why meaning survives
Chunk overlap: zero — by design. Overlap exists to repair context broken by arbitrary splitting; if chunks are born self-contained, there is nothing to repair. Every number lives next to its label, period, and unit, so the question "sales in FY2023" and its answer occupy the same embedding neighborhood. Relationships across chunks (e.g., margin trends across quarters) are reconstructed at retrieval time by the LLM from multiple complete chunks — which is what LLMs are good at.

Embedding Model & Dimensions

AspectImplementation
Modelgemini-embedding-001 — selected at runtime from a candidate list, because text-embedding-004 was deprecated mid-project
DimensionsDiscovered dynamically from the first embedding batch; the Qdrant collection is created (or rebuilt) to match — model swaps don't break the store
DistanceCosine
Task typesretrieval_document at ingest, retrieval_query at question time (asymmetric optimization)
Batching100 texts per API call

Retrieval Mechanism: Hybrid, Filtered, Multi-Collection

Pure vector search fails financial queries in predictable ways — "Q3 sales" matches annual sales chunks; one stock's metrics bleed into another's. The fix is filter-then-rank:

  1. Payload pre-filter — Qdrant indexed fields (stock_id, statement_type, fiscal_years, doc_type) constrain the candidate set before similarity ranking. Multi-stock isolation is a hard filter, not a hope.
  2. Cosine top-k across both collections (financials + documents), 5 hits each, merged into ≤10 context chunks.
  3. Context-locked generation — the answer prompt forbids outside knowledge, requires exact figures with period labels, and forces unit disambiguation (Excel = ₹ Crores; presentations often ₹ Mn).

The payoff: asked "How has margin guidance evolved over the last few quarters?", the system pulled chunks spanning Q1 FY25 through Q4 FY26 concalls and synthesized the guidance arc — long-term >15% EBITDA target, raw-material headwinds, plant-closure tailwinds — with every claim traceable to a retrieved chunk.

8. Daily Automation: Eliminating the Manual Grind

One Cloud Scheduler trigger at 06:00 IST runs the full pipeline per stock: prices → performance → news → document sync → recommendation evaluation → fresh recommendation. Here's what that replaces:

TaskManual WorkflowStockApp AutomatedTime Saved/Day
Price & benchmark checkOpen broker app, check stock, open NSE site, compare 4 indices, compute alpha by hand5paisa API + computed alpha across all benchmarks~15 min
Financials reviewRe-download Screener Excel, eyeball trendsParsed once; trends & YoY pre-computed; chatbot for any lookup~20 min
New document detectionCheck Screener/BSE manually for new filingsScraper detects, downloads, validates, dedupes~10 min
Presentation analysisRead 30–40 slides, take notesStructured extraction in ~20s: revenue mix, capex, guidance~45 min
Concall analysisRead/listen 60–90 min, separate signal from PRTone, guidance, risks, key Q&A extracted; PR-skepticism built into downstream prompt~75 min
News scanGoogle, filter noise, judge sentimentSearch-grounded agent, sentiment-tagged, deduped~15 min
Synthesis & decisionHold it all in your head; vulnerable to recency & confirmation biasMaster agent with mandatory bear case, calibrated confidence, portfolio-aware sizing~30 min
Track-record reviewAlmost nobody does this honestlyAutomatic daily scoring vs actual prices; fed back into the agentpriceless
Total (per stock)~3.5 hours~4 minutes of compute~98%

And the automated version scales linearly with watchlist size, never skips a step on a busy day, and never "forgets" to log whether yesterday's call worked out.

9. The Backtest That Humbled My Agent

Before trusting the recommendations, I built a harness that replays the agent at historical month-ends with strictly point-in-time data, then scores each call against real forward returns. The verdict on 8 monthly replays:

Agent avg forward 3M:      −3.43%
Buy-and-hold avg 3M:       −3.43%   ← identical: ZERO selection edge
Action distribution:        BUY 8/8  ← permabull
High-confidence (≥70) avg:  −5.28%, 40% target-hit
Low-confidence  (<70) avg:  −0.36%, 67% target-hit  ← confidence INVERTED
Highest-confidence call:    80% conf, at the exact cycle top, −14.7% fwd 3M
Diagnosis
The agent was a momentum-following bull with inversely-calibrated confidence. Cause: concall transcripts are maximally optimistic at tops, and the prompt treated them as truth. The only thing saving the strategy was its stop-losses.
The fix — and the measured result
Three layers: adversarial prompt framing + mandatory bear case, the agent's own scorecard injected into every prompt, and a code-level confidence governor. On identical live data the recommendation moved from ADD @ 75% to HOLD @ 50% with this bear case: "…the stock is a value trap in a persistent downtrend. Any technical bounce will be short-lived… as margin compression continues." An agent that knows its own failure modes behaves differently.

10. Engineering Challenges

ChallengeRoot CauseResolution
All financials became 1970-01-01pandas datetime coercion of numericsType-strict cleaning; only true datetime cells converted
Two Gemini models deprecated mid-buildProvider model churnRuntime model auto-detection with candidate fallback lists
Benchmark data was the wrong indexAssumed scrip codes; NIFTY AUTO ↔ NIFTY PSE swapScrip-master verification + empty-frame guards
News items silently droppedGrounding redirect URLs > 200-char URLFieldTextField + truncation; per-item exception isolation
One bad import killed the whole APIModule-level imports of optional servicesFunction-level imports for non-critical paths
60+ document flood on first syncUnbounded archive scrapePer-run cap, coverage cap, 429 retry with backoff
Agent permabull biasPromotional source material taken at face valuePrompt discipline + scorecard feedback + code governor

11. Pros & Cons

Pros

  • ~98% reduction in daily research time, scaling linearly across a watchlist
  • Deterministic math + LLM interpretation = no hallucinated numbers
  • 16 quarters of document memory queryable in natural language
  • Built-in honesty: every call scored against reality, biases measured not assumed
  • Self-correcting loop — the agent reads its own report card
  • Hybrid retrieval prevents cross-stock contamination by construction
  • Modular agents fail independently; pipeline degrades gracefully

Cons & Limitations

  • n=8 backtest on one stock is evidence of process, not statistical edge
  • Mild lookahead leakage: document insights generated by today's extraction prompts
  • LLM API costs scale with document volume; model deprecations require maintenance
  • Screener scraping is brittle vs bot protection; exchange APIs are the robust path
  • Single-VM data plane is a cost compromise (snapshot-mitigated)
  • Recommendations are decision support, not financial advice — the human stays in the loop

12. References

  1. Qdrant — Filtering & Hybrid Search: qdrant.tech/documentation/concepts/filtering
  2. Gemini API — Document (PDF) Understanding: ai.google.dev/gemini-api/docs/document-processing
  3. Gemini API — Grounding with Google Search: ai.google.dev/gemini-api/docs/grounding
  4. Gemini API — Embeddings & task types: ai.google.dev/gemini-api/docs/embeddings
  5. 5paisa OpenAPI documentation: 5paisa.com/developerapi
  6. Celery — Tasks, retries & backoff: docs.celeryq.dev
  7. Django REST Framework: django-rest-framework.org
  8. Cloud Run + Cloud Scheduler (OIDC invocation): cloud.google.com/run/docs/triggering/using-scheduler
  9. Terraform Google Provider: registry.terraform.io
  10. Screener.in — financial data source: screener.in