Building StockApp: A Multi-Agent AI Equity Research Platform That Learned From Its Own Bias
Automating the complete stock analysis workflow — Excel parsing, market data, document intelligence, RAG Q&A, and daily AI recommendations — and what happened when I backtested the agent against itself.
DjangoReactGemini QdrantRAGMulti-Agent CeleryGCPTerraform
- The Problem
- High-Level Design & Architecture
- Cloud Infrastructure
- The Agents, In Detail
- Multi-Agent Architecture
- Why Prompts Are the Product
- RAG Deep-Dive: Chunks, Dimensions, Retrieval
- Daily Automation: Manual vs Automated
- The Backtest That Humbled My Agent
- Engineering Challenges
- Pros & Cons
- References
1. The Problem
Researching a single stock properly takes hours every day: downloading financials from Screener.in, checking price action against benchmarks, reading 30-page investor presentations, parsing concall transcripts for management tone, scanning news, and finally synthesizing all of it into a decision. Multiply that across a watchlist and it becomes a full-time job.
StockApp automates the entire loop. Upload a Screener Excel export once; the system handles everything else daily — including downloading new documents the moment they appear, analyzing them with LLM agents, and issuing a recommendation that knows your actual portfolio position and its own track record.
2. High-Level Design & Architecture
┌──────────────────────────────── GCP ─────────────────────────────────┐ │ │ │ Cloud Run (Frontend) Cloud Run (Backend + Worker) │ │ ┌───────────────┐ HTTPS ┌─────────────────────────────────┐ │ │ │ React + Vite │────────────►│ Django + DRF (JWT, opt. 2FA) │ │ │ │ Tailwind │ │ ┌───────────────────────────┐ │ │ │ │ Framer Motion│ │ │ Celery Workers (Agents) │ │ │ │ └───────────────┘ │ │ · Excel Parser │ │ │ │ │ │ · Price/Performance │ │ │ │ Cloud Scheduler ──► daily ───►│ │ · IP Analysis Agent │ │ │ │ (06:00 IST) pipeline │ │ · Concall Agent │ │ │ │ │ │ · News Agent (grounded) │ │ │ │ │ │ · Screener Doc Sync │ │ │ │ │ │ · Master Reco Agent │ │ │ │ │ │ · Evaluator / Backtester │ │ │ │ │ └───────────────────────────┘ │ │ │ └──────────────┬──────────────────┘ │ │ │ VPC Connector │ │ ┌──────────────── Compute Engine VM ─────────▼──────────────────┐ │ │ │ MySQL 8 Redis Qdrant │ │ │ │ (facts) (broker + (semantic vectors) │ │ │ │ metrics, prices, session cache) excel_financials, │ │ │ │ recos, portfolio documents │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ │ GCS: raw Excel/PDF artifacts · Secret Manager: keys & TOTP │ └───────────────────────────────────────────────────────────────────────┘ External: screener.in · 5paisa OpenAPI · Gemini API · Google Search
Technology Stack
| Layer | Choice | Why |
|---|---|---|
| Frontend | React 18, Vite, Tailwind CSS 4, Framer Motion, Recharts | Fast iteration, dark fintech aesthetic, micro-animations |
| Backend | Django 5 + DRF + SimpleJWT | Mature ORM/auth; admin for free |
| Async | Celery 5 + Redis | PDF analysis runs 30–90s; can't block HTTP |
| Databases | MySQL 8 + Qdrant | Facts vs meaning separation |
| LLM | Gemini 2.5 Pro / Flash | Native PDF input, JSON mode, search grounding |
| Embeddings | gemini-embedding-001 (auto-detected) | Survived two model deprecations mid-project |
| Market data | 5paisa OpenAPI | Stock + index candles, TOTP session |
| Infra | Terraform + Cloud Build + Cloud Scheduler | Env-separated IaC, daily cron |
3. Cloud Infrastructure
The deployment splits cleanly along the stateless/stateful boundary:
- Cloud Run hosts three services: the React frontend, the Django API
(scale-to-zero), and the Celery worker (pinned
min_instances=1with internal-only ingress — workers must always be listening to the queue). - A single GCE VM runs MySQL, Redis, and Qdrant via docker-compose on an attached SSD persistent disk. Firewall rules allow ports 3306/6379/6333 only from the serverless VPC connector range — the data plane has no public surface.
- Cloud Scheduler fires the daily pipeline at 06:00 IST via an OIDC-authenticated HTTP call; Cloud Run IAM ensures only the scheduler's service account can invoke it.
- Secret Manager holds everything sensitive: Django secret, DB password, Gemini key, and the 5paisa credential set including the TOTP seed URI.
- Terraform is organized as reusable modules (network, datavm, cloudrun,
scheduler, storage, secrets) with separate
dev/prodtfvars; Cloud Build runsplanon PR andapplyon merge.
4. The Agents, In Detail
4.1 Excel Parser (deterministic "agent")
Not an LLM — and that's the point. Screener exports have a fixed template (P&L,
Quarters, Balance Sheet, Cash Flow all formula-derived from a master Data Sheet), so a
schema-aware state machine beats any generic approach. It walks Data Sheet sections
(META → PROFIT & LOSS → Quarters → BALANCE SHEET → CASH FLOW → PRICE → DERIVED),
resets period headers at each Report Date row, and emits ~130 typed metric
records.
pd.to_datetime(errors="coerce"). Pandas
interpreted every numeric as nanoseconds since epoch — so Sales of ₹14,600 Cr
became 1970-01-01. Every single financial figure was silently destroyed.
Lesson: never datetime-coerce columns that contain numbers.4.2 Price & Performance Fetcher
Authenticates to 5paisa with TOTP (session cached in Redis, 6h TTL), pulls ~400 days of candles for the stock and every benchmark index, then computes YTD/1W/1M/3M/6M/9M returns and — the metric that matters — alpha (stock return minus index return) per benchmark, with sector auto-mapping (Auto → NIFTY AUTO, IT → NIFTY IT).
4.3 Investor Presentation Agent
Sends the raw PDF to Gemini (inline bytes ≤19MB; Files API above) with
response_mime_type: application/json and a fixed extraction schema: about,
revenue mix by product/channel/region, regional financials, highlights, growth plans,
capex, government incentives, new products. Gemini reads the charts and tables inside
the deck images — no OCR pipeline needed. Apollo Tyres' Q4 deck extracted in 19 seconds
with revenue splits accurate to the percentage point.
4.4 Concall Transcript Agent
Same PDF mechanism, different schema: it separates management commentary from analyst Q&A, classifies tone (BULLISH → BEARISH), and extracts demand outlook, margin commentary, pricing actions, explicit guidance, risks, key Q&A pairs, and up to three verbatim quotes. The output of 16 quarters of these became the corpus for cross-quarter RAG questions.
4.5 News Agent (Search-Grounded)
Gemini 2.5 Flash with the google_search tool finds the last 7 days of
company news and returns sentiment-tagged JSON. Two production gotchas: grounded
responses wrap JSON in prose (solved with a tolerant first-[-to-last-]
extractor), and grounding redirect URLs exceed 500 characters (solved by switching
URLField → TextField).
4.6 Screener Document Sync
Scrapes the Documents section of the company's Screener page, classifies links by
label (transcript/presentation/annual report), validates downloads by %PDF
magic bytes, dedupes against a URL ledger, and auto-queues analysis. First run it
discovered and processed Apollo's entire concall archive unattended — which also taught
me to add per-run caps and 429 retry-with-backoff.
4.7 Master Recommendation Agent
The synthesis layer (Gemini 2.5 Pro). Inputs: fundamentals summary from MySQL, price and alpha data, pre-computed technicals, latest IP + concall insights, news, the user's actual position, and the agent's own evaluated track record. Output: a strict JSON contract — action, confidence, a mandatory bear case, short/long-term targets with stop-loss and timelines, four-pillar rationale, key risks, and portfolio-specific sizing advice.
4.8 Evaluator & Backtester
Every recommendation is scored daily against real candles: target hit? stop-loss hit? maximum favorable/adverse excursion? The backtester goes further — monthly point-in-time replays (candles truncated to the as-of date; only documents published before that date) scored against actual forward 1M/3M/6M returns, bucketed by action and confidence level.
5. Multi-Agent Architecture
This is a hub-and-spoke design, not a free-form agent swarm. Specialist agents run independently and persist structured outputs; the master agent consumes those artifacts — never raw inputs.
┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
│ Excel │ │ Price/Perf │ │ IP Agent │ │ Concall │
│ Parser │ │ + Technicals│ │ (PDF→JSON) │ │ Agent │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │ │
MySQL + MySQL MySQL + Qdrant MySQL + Qdrant
Qdrant │ │ │
│ │ │ │
┌─────┴───────────────┴───────────────┴───────────────┴─────┐
│ STRUCTURED ARTIFACT LAYER │
│ (typed rows + JSON insights + embedded chunks) │
└─────┬───────────────────────────────────────────────┬─────┘
│ │
┌─────────▼─────────┐ ┌──────────────┐ ┌─────────▼────────┐
│ MASTER RECO │◄──│ News Agent │ │ RAG Chat │
│ AGENT │ │ (grounded) │ │ (analyst Q&A) │
│ + Confidence │ └──────────────┘ └──────────────────┘
│ Governor (code)│
└─────────┬─────────┘
│
┌─────────▼─────────┐ ┌──────────────────┐
│ Recommendation │────────►│ EVALUATOR │──┐
│ (persisted) │ │ scores vs actual │ │ feedback loop:
└───────────────────┘ │ prices daily │ │ track record
▲ └──────────────────┘ │ injected into
└─────────────────────────────────────────┘ next prompt
Why this shape instead of agent-to-agent messaging?
- Debuggability — every intermediate artifact is inspectable in MySQL/Qdrant. When the recommendation looks wrong, you can see exactly which input misled it.
- Independent failure domains — a news-agent outage doesn't block the recommendation; it degrades gracefully to "News: None".
- Cost control — expensive extractions (PDF analysis) run once and are reused by both the master agent and the chat, instead of re-reading PDFs per query.
- The feedback loop is the differentiator — the evaluator's scorecard flows back into the master agent's context, making it the only agent in the system that reads its own report card.
6. Why Prompts Are the Product
The same model, same data, same architecture produced opposite recommendations depending purely on prompt design. Three prompt engineering decisions mattered more than any infrastructure choice:
① Fixed JSON schemas turn LLMs into reliable parsers
Every extraction agent receives an exact output contract ("Return STRICT JSON with
EXACTLY this structure, use null/[] if not found") combined with
response_mime_type: application/json. Across 60+ document analyses, schema
adherence was effectively 100% — the unreliable part of LLMs isn't structure, it's
unconstrained generation.
② Source skepticism must be explicit
The original master prompt treated concall insights as ground truth. Result: the agent absorbed management's optimism wholesale. The fix was adversarial framing baked into the prompt: "Management commentary is INHERENTLY PROMOTIONAL... Treat guidance as marketing until corroborated by reported numbers" — plus a mandatory bear_case field: if the model can't argue against its own call, it hasn't analyzed the stock.
③ Self-evaluation in-context changes behavior
The most surprising finding. Injecting the agent's own track record — "Your high-confidence calls averaged −5.3% vs −0.4% for low-confidence ones. You issued BUY 8/8 times including at the cycle top." — measurably recalibrated it. On identical live data, the recommendation shifted from ADD @ 75% confidence to HOLD @ 50% with an articulate value-trap bear case.
7. RAG Deep-Dive: Chunks, Dimensions, Retrieval
The Flow
INGESTION RETRIEVAL
───────── ─────────
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Excel │ │ IP / Concall │ │ User question │
│ (Screener) │ │ PDFs │ │ "How has margin │
└──────┬──────┘ └──────┬───────┘ │ guidance │
│ │ │ evolved?" │
schema-aware Gemini structured └────────┬────────┘
parser extraction (JSON) │
│ │ embed (gemini-
▼ ▼ embedding-001)
┌─────────────────────────────┐ │
│ SEMANTIC CHUNK BUILDER │ ▼
│ · 1 chunk per metric │ ┌──────────────────────────┐
│ · 1 chunk per insight │ │ Qdrant query_points │
│ section │ │ payload filter: │
│ · headers baked into │ │ stock_id = N │
│ every chunk │ │ + cosine similarity │
│ · auto YoY/QoQ stats │ │ across BOTH collections │
└──────────┬──────────────────┘ └────────────┬─────────────┘
│ │
embed (gemini-embedding-001, top-k chunks (5 + 5)
task_type=retrieval_document) │
│ ▼
▼ ┌──────────────────────────┐
┌─────────────────────────────┐ │ Gemini 2.5 Flash │
│ QDRANT │ │ context-locked prompt: │
│ · excel_financials │ │ "Answer ONLY from │
│ · documents │ │ context. Quote exact │
│ vectors + rich payload │ │ numbers & periods. │
│ (stock_id, statement_type, │ │ State units." │
│ fiscal_years, doc_type, │ └────────────┬─────────────┘
│ period, metric, raw_text) │ ▼
└─────────────────────────────┘ Grounded answer
Chunking Strategy: Semantic Units, Not Token Windows
Standard RAG advice says "512-token chunks with 50-token overlap." For financial data this is exactly wrong: a token window slices a table row away from its column headers, producing orphaned numbers no embedding can rescue. StockApp chunks by semantic unit instead:
- Financial chunks — one per line item, carrying the company name, statement type, unit declaration, every period:value pair, plus auto-computed latest YoY/QoQ change and min/max across periods.
- Document chunks — one per insight section (demand outlook, margin commentary, capex…), prefixed with company / document type / period.
Company: Apollo Tyres Ltd (NSE/BSE listed)
Financial Statement: Income Statement
Metric: Sales
Unit: INR Crores
Values:
Mar-2017: 13180.04
Mar-2018: 14842.85
...
Mar-2025: 26123.42
Latest YoY change (Mar-2024 → Mar-2025): +2.9%
Highest: 26123.42 | Lowest: 13180.04 across 9 periods
Embedding Model & Dimensions
| Aspect | Implementation |
|---|---|
| Model | gemini-embedding-001 — selected at runtime from a candidate list, because text-embedding-004 was deprecated mid-project |
| Dimensions | Discovered dynamically from the first embedding batch; the Qdrant collection is created (or rebuilt) to match — model swaps don't break the store |
| Distance | Cosine |
| Task types | retrieval_document at ingest, retrieval_query at question time (asymmetric optimization) |
| Batching | 100 texts per API call |
Retrieval Mechanism: Hybrid, Filtered, Multi-Collection
Pure vector search fails financial queries in predictable ways — "Q3 sales" matches annual sales chunks; one stock's metrics bleed into another's. The fix is filter-then-rank:
- Payload pre-filter — Qdrant indexed fields (
stock_id,statement_type,fiscal_years,doc_type) constrain the candidate set before similarity ranking. Multi-stock isolation is a hard filter, not a hope. - Cosine top-k across both collections (financials + documents), 5 hits each, merged into ≤10 context chunks.
- Context-locked generation — the answer prompt forbids outside knowledge, requires exact figures with period labels, and forces unit disambiguation (Excel = ₹ Crores; presentations often ₹ Mn).
The payoff: asked "How has margin guidance evolved over the last few quarters?", the system pulled chunks spanning Q1 FY25 through Q4 FY26 concalls and synthesized the guidance arc — long-term >15% EBITDA target, raw-material headwinds, plant-closure tailwinds — with every claim traceable to a retrieved chunk.
8. Daily Automation: Eliminating the Manual Grind
One Cloud Scheduler trigger at 06:00 IST runs the full pipeline per stock: prices → performance → news → document sync → recommendation evaluation → fresh recommendation. Here's what that replaces:
| Task | Manual Workflow | StockApp Automated | Time Saved/Day |
|---|---|---|---|
| Price & benchmark check | Open broker app, check stock, open NSE site, compare 4 indices, compute alpha by hand | 5paisa API + computed alpha across all benchmarks | ~15 min |
| Financials review | Re-download Screener Excel, eyeball trends | Parsed once; trends & YoY pre-computed; chatbot for any lookup | ~20 min |
| New document detection | Check Screener/BSE manually for new filings | Scraper detects, downloads, validates, dedupes | ~10 min |
| Presentation analysis | Read 30–40 slides, take notes | Structured extraction in ~20s: revenue mix, capex, guidance | ~45 min |
| Concall analysis | Read/listen 60–90 min, separate signal from PR | Tone, guidance, risks, key Q&A extracted; PR-skepticism built into downstream prompt | ~75 min |
| News scan | Google, filter noise, judge sentiment | Search-grounded agent, sentiment-tagged, deduped | ~15 min |
| Synthesis & decision | Hold it all in your head; vulnerable to recency & confirmation bias | Master agent with mandatory bear case, calibrated confidence, portfolio-aware sizing | ~30 min |
| Track-record review | Almost nobody does this honestly | Automatic daily scoring vs actual prices; fed back into the agent | priceless |
| Total (per stock) | ~3.5 hours | ~4 minutes of compute | ~98% |
And the automated version scales linearly with watchlist size, never skips a step on a busy day, and never "forgets" to log whether yesterday's call worked out.
9. The Backtest That Humbled My Agent
Before trusting the recommendations, I built a harness that replays the agent at historical month-ends with strictly point-in-time data, then scores each call against real forward returns. The verdict on 8 monthly replays:
Agent avg forward 3M: −3.43%
Buy-and-hold avg 3M: −3.43% ← identical: ZERO selection edge
Action distribution: BUY 8/8 ← permabull
High-confidence (≥70) avg: −5.28%, 40% target-hit
Low-confidence (<70) avg: −0.36%, 67% target-hit ← confidence INVERTED
Highest-confidence call: 80% conf, at the exact cycle top, −14.7% fwd 3M
10. Engineering Challenges
| Challenge | Root Cause | Resolution |
|---|---|---|
All financials became 1970-01-01 | pandas datetime coercion of numerics | Type-strict cleaning; only true datetime cells converted |
| Two Gemini models deprecated mid-build | Provider model churn | Runtime model auto-detection with candidate fallback lists |
| Benchmark data was the wrong index | Assumed scrip codes; NIFTY AUTO ↔ NIFTY PSE swap | Scrip-master verification + empty-frame guards |
| News items silently dropped | Grounding redirect URLs > 200-char URLField | TextField + truncation; per-item exception isolation |
| One bad import killed the whole API | Module-level imports of optional services | Function-level imports for non-critical paths |
| 60+ document flood on first sync | Unbounded archive scrape | Per-run cap, coverage cap, 429 retry with backoff |
| Agent permabull bias | Promotional source material taken at face value | Prompt discipline + scorecard feedback + code governor |
11. Pros & Cons
Pros
- ~98% reduction in daily research time, scaling linearly across a watchlist
- Deterministic math + LLM interpretation = no hallucinated numbers
- 16 quarters of document memory queryable in natural language
- Built-in honesty: every call scored against reality, biases measured not assumed
- Self-correcting loop — the agent reads its own report card
- Hybrid retrieval prevents cross-stock contamination by construction
- Modular agents fail independently; pipeline degrades gracefully
Cons & Limitations
- n=8 backtest on one stock is evidence of process, not statistical edge
- Mild lookahead leakage: document insights generated by today's extraction prompts
- LLM API costs scale with document volume; model deprecations require maintenance
- Screener scraping is brittle vs bot protection; exchange APIs are the robust path
- Single-VM data plane is a cost compromise (snapshot-mitigated)
- Recommendations are decision support, not financial advice — the human stays in the loop
12. References
- Qdrant — Filtering & Hybrid Search: qdrant.tech/documentation/concepts/filtering
- Gemini API — Document (PDF) Understanding: ai.google.dev/gemini-api/docs/document-processing
- Gemini API — Grounding with Google Search: ai.google.dev/gemini-api/docs/grounding
- Gemini API — Embeddings & task types: ai.google.dev/gemini-api/docs/embeddings
- 5paisa OpenAPI documentation: 5paisa.com/developerapi
- Celery — Tasks, retries & backoff: docs.celeryq.dev
- Django REST Framework: django-rest-framework.org
- Cloud Run + Cloud Scheduler (OIDC invocation): cloud.google.com/run/docs/triggering/using-scheduler
- Terraform Google Provider: registry.terraform.io
- Screener.in — financial data source: screener.in