SuperLawyer — Implementation Blueprint

A production-grade AI legal assistant grounded in ~20,000 private legal documents, delivering cited, hallucination-resistant answers for Brazilian law practitioners.

1. Problem Statement

Legal professionals need an AI assistant that can:

Research legislation, case law (jurisprudência), and doctrine across Brazilian courts (STF, STJ, TST, TRFs).
Draft petitions, contracts, and legal opinions grounded in real precedent.
Review uploaded documents for compliance, risk, and missing clauses.
Cite precisely — every claim must reference an exact statute number, article, súmula, or acórdão. Hallucination is a career-ending failure mode in law.

The system must handle Portuguese-language legal PDFs with complex structure (articles, clauses, tables, footnotes, multi-column layouts) from sources like Planalto.gov.br, Diário Oficial, and private firm repositories.

2. Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                        FRONTEND (Next.js)                    │
│  Chat UI · Action buttons · Document upload · Source viewer  │
│  Deep-read panel (PageIndex tree visualization)              │
└──────────────────────┬───────────────────────────────────────┘
                       │ REST / WebSocket
┌──────────────────────▼───────────────────────────────────────┐
│                    API LAYER (FastAPI)                        │
│         Auth · Rate limiting · Session management            │
└──────────────────────┬───────────────────────────────────────┘
                       │
┌──────────────────────▼───────────────────────────────────────┐
│               AGENT ROUTER (LangGraph)                       │
│  Intent detection → route to the right tool/workflow         │
│                                                              │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────────┐  │
│  │  Research    │  │  Draft       │  │  Document Review   │  │
│  │  Agent       │  │  Agent       │  │  Agent             │  │
│  └──────┬──────┘  └──────┬───────┘  └────────┬───────────┘  │
│         │                │                    │              │
│  ┌──────▼────────────────▼────────────────────▼───────────┐  │
│  │          RETRIEVAL ENGINE (LlamaIndex)                 │  │
│  │  Hybrid Search (Dense + Sparse/BM25) → Reranker        │  │
│  │  Parent-doc retriever · Sentence-window retriever       │  │
│  │  Metadata filtering (court, date, doc type, status)     │  │
│  └──────────┬────────────────────────────────────────────┘  │
│             │                                                │
│  ┌──────────▼──────────┐  ┌──────────────────────────────┐  │
│  │  VECTOR DB (Qdrant)  │  │  DEEP READER (PageIndex)    │  │
│  │  Dense (BGE-M3)      │  │  Tree index per document     │  │
│  │  + Sparse (BM25)     │  │  LLM reasoning over tree     │  │
│  │  Metadata payloads   │  │  Exact page/section refs     │  │
│  │                      │  │  Agentic chat w/ citations   │  │
│  │  ── FAST CORPUS ──   │  │  ── DEEP SINGLE-DOC ──      │  │
│  │  "find the docs"     │  │  "read the doc like a        │  │
│  │                      │  │   senior associate"           │  │
│  └──────────────────────┘  └──────────────────────────────┘  │
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │              LLM (Claude Sonnet/Opus via API)         │  │
│  │  System prompt enforces citation-only answers          │  │
│  │  Fallback: Gemini 2.5 Pro (large context window)      │  │
│  └───────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│              OFFLINE INGESTION PIPELINE                       │
│  LlamaParse → Hierarchical Chunking → Metadata Enrichment    │
│  → BGE-M3 Embedding → Qdrant Upsert                         │
│  PageIndex tree pre-generation (high-value docs) [cached]    │
│  Scheduled via Prefect / Airflow                             │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│              OBSERVABILITY & EVALUATION                       │
│  Arize Phoenix / LangSmith · RAGAS evaluation suite          │
│  Faithfulness · Answer relevance · Citation accuracy         │
└──────────────────────────────────────────────────────────────┘

Two-tier retrieval model: Qdrant handles fast corpus-wide search across 20k docs (milliseconds). PageIndex handles deep, structure-aware reasoning within a single document once it's been identified (seconds). The agent router decides when to escalate from corpus search to deep read — this is the "Modular RAG" pattern that outperforms both pure vector and pure long-context approaches.

3. Technology Stack

Layer	Choice	Rationale
Orchestration	LlamaIndex (core)	Best-in-class for heavy document ingestion, hierarchical indexing, and advanced RAG. Purpose-built for knowledge-base work.
Agentic Workflows	LangGraph	Stateful, graph-based agent orchestration. Handles multi-step workflows (research → draft → review) that LlamaIndex alone doesn't model well.
PDF Parsing	LlamaParse	GenAI-native parser that preserves tables, footnotes, article numbering, and multi-column layouts — critical for Brazilian legal PDFs.
Embeddings	BGE-M3 (`BAAI/bge-m3`)	Local, multilingual (excellent Portuguese support), produces both dense and sparse vectors in a single model. No per-call API cost. Alternative: Voyage AI `voyage-law-2` if budget allows (trained on legal corpora).
Vector Database	Qdrant (self-hosted)	Native hybrid search (dense + BM25), rich metadata filtering, Kubernetes-ready, zero vendor lock-in.
Reranker	Cohere Rerank 3 or BGE-reranker	Post-retrieval precision boost. Cross-encoder scoring ensures the most relevant chunks surface for the LLM.
LLM	Claude Sonnet/Opus (Anthropic)	Strongest legal reasoning, citation fidelity, and safety in 2026. Fallback: Gemini 2.5 Pro (massive context window for full-document analysis).
API Backend	FastAPI	Async-native, OpenAPI docs, WebSocket support for streaming responses.
Frontend	Next.js + Tailwind	Modern UI with chat interface, action buttons, document upload, inline source citations.
Deep Reader	PageIndex (open-source, MIT license)	Vectorless, reasoning-based retrieval for single-document deep analysis. Runs locally — builds hierarchical tree indexes and uses our existing LLM (Claude) to navigate them. No additional API cost beyond the LLM calls we already pay for.
Monitoring	Arize Phoenix + RAGAS	Trace every retrieval and generation step. Evaluate faithfulness, answer relevance, and citation accuracy on golden question sets.
Scheduling	Prefect	Orchestrate batch ingestion runs, re-embedding on doc updates.
Deployment	Docker Compose (start) → Kubernetes (SaaS scale)	Everything containerized. Single-server Docker Compose for one firm. K8s only when serving multiple firms as a SaaS.

4. Detailed Design

4.1 Ingestion Pipeline

The ingestion pipeline runs offline (once initially, then incrementally on new/updated documents).

Step 1 — Parse PDFs with LlamaParse

LlamaParse uses vision models to extract text while preserving structural semantics. This is non-negotiable for legal PDFs where article numbering, clause hierarchy, and table integrity matter.

from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction=(
        "Extract articles, clauses, súmulas, acórdãos exactly. "
        "Preserve all numbering, headers, and table structure. "
        "Do not summarize or paraphrase."
    ),
)

Step 2 — Hierarchical / Semantic Chunking

Never chunk by fixed character count. Legal documents must be chunked by structural elements (Article → Section → Clause). Use LlamaIndex's HierarchicalNodeParser to create a parent-child node tree so the retriever can return a narrow clause for precision or escalate to the full article for context.

from llama_index.core.node_parser import (
    HierarchicalNodeParser,
    SentenceSplitter,
)

node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 1024, 512],
)

Step 3 — Metadata Enrichment

Every chunk must carry rich metadata for deterministic pre-filtering at query time:

Field	Example	Purpose
`source`	`planalto.gov.br`	Origin tracking
`document_type`	`statute`, `case_law`, `contract`, `doctrine`	Route-aware filtering
`jurisdiction`	`federal`, `SP`, `RJ`	Jurisdiction scoping
`court`	`STF`, `STJ`, `TST`, `TRF-3`	Court-specific queries
`date`	`2025-06-15`	Temporal filtering
`status`	`active`, `revoked`, `overturned`	Exclude dead law
`document_title`	`Lei nº 13.709/2018 (LGPD)`	Citation rendering

Metadata is attached during ingestion via custom transformations or extracted by LlamaParse itself from document headers.

Step 4 — Embed & Store

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

client = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(
    client=client,
    collection_name="brazil_legal",
    enable_hybrid=True,
)

Ingestion for 20k docs should be batched with asyncio — expect a few hours for the initial run. Subsequent runs use Qdrant upsert on changed documents only.

4.2 Retrieval Engine

The retrieval engine is the heart of the system. It chains three stages to guarantee precision:

Stage 1 — Hybrid Search (Dense + Sparse)

Vector (dense) search captures semantic similarity ("What are the penalties for breach of confidentiality?"). Keyword (sparse/BM25) search captures exact matches ("Lei nº 13.709/2018", "Súmula 331 do TST"). Combining both eliminates the blind spots of either approach alone — this is the single biggest accuracy lever for legal RAG.

retriever = index.as_retriever(
    similarity_top_k=10,
    sparse_top_k=10,
    vector_store_query_mode="hybrid",
)

Stage 2 — Metadata Pre-Filtering

Before search results are scored, Qdrant filters by metadata to scope results to the right jurisdiction, court, document type, and time range. This is deterministic (no AI involved) and prevents the model from citing revoked statutes or irrelevant jurisdictions.

from llama_index.core.vector_stores import MetadataFilters, MetadataFilter

filters = MetadataFilters(filters=[
    MetadataFilter(key="court", value="STF"),
    MetadataFilter(key="status", value="active"),
])

retriever = index.as_retriever(
    similarity_top_k=10,
    sparse_top_k=10,
    vector_store_query_mode="hybrid",
    filters=filters,
)

Stage 3 — Reranking

Hybrid search returns ~20 candidates. A cross-encoder reranker scores each against the original query and surfaces the top 5. This step is critical — it turns a good retrieval into a precise one.

from llama_index.postprocessor.cohere_rerank import CohereRerank

reranker = CohereRerank(api_key="...", top_n=5)

query_engine = index.as_query_engine(
    retriever=retriever,
    node_postprocessors=[reranker],
)

Advanced Retrieval Techniques (Phase 2+)

Technique	What it does	When to add
PageIndex deep read	Escalates to tree-based reasoning for deep single-document analysis (see Section 4.7).	Phase 2
HyDE (Hypothetical Document Embeddings)	LLM generates a hypothetical answer, embeds it, uses that embedding for retrieval. Helps with vague queries.	After baseline is working
Multi-query retrieval	Rewrites the user query into 3-5 variants, retrieves for each, merges results. Captures different phrasings.	After baseline is working
Parent-document retriever	Returns the narrow matching chunk but also fetches its parent node (full article/section) for context.	From the start (built into hierarchical chunking)
Sentence-window retriever	Returns a narrow match but expands the context window around it. Good for clause-level precision.	After baseline is working
GraphRAG	Builds a knowledge graph of citation relationships between cases. Enables "find all cases that cite X".	Phase 3

4.3 Agent Router (LangGraph)

The UI exposes distinct actions (Research, Draft Petition, Review Document, Check Compliance). Instead of routing everything through one prompt, a lightweight LLM-based router classifies intent and dispatches to specialized agents.

User Input
    │
    ▼
┌─────────────────┐
│  Intent Router   │  (lightweight LLM call or classifier)
└────┬──┬──┬──┬───┘
     │  │  │  │
     ▼  ▼  ▼  ▼
 Research  Draft  Review  Compliance
  Agent   Agent   Agent    Agent

Each agent is a LangGraph StateGraph with its own node sequence. Agents have access to two retrieval tools and decide which to use:

Qdrant (fast corpus search): For finding documents across the 20k corpus. Millisecond latency.
PageIndex (deep reader): For reasoning within a single document once identified. Exact page/section citations.

Research Agent: Query rewrite → Hybrid retrieval (Qdrant) → Rerank → If user needs deeper analysis of a specific source, escalate the top doc to PageIndex deep read → LLM synthesis with citations.

Draft Agent: Research via Qdrant → Identify key source documents → PageIndex deep read on the most relevant 1-3 docs for precise clause extraction → Outline generation → Section-by-section drafting → Self-review for legal accuracy → Output with embedded citations.

Review Agent: Uploaded document goes directly to PageIndex (tree build + agentic chat) for structure-aware deep analysis. PageIndex navigates the document like a senior associate — finding key clauses, risks, and obligations. Compliance rules are retrieved from Qdrant in parallel for cross-referencing.

Compliance Agent: Retrieves the applicable regulatory framework from Qdrant, then runs PageIndex deep read on the user's contract/document to extract every relevant clause. Outputs a compliance checklist with pass/fail per requirement, citing exact page and section from both the regulation and the document.

from langgraph.graph import StateGraph, END

def route_intent(state):
    intent = classify_intent(state["user_input"])
    return intent  # "research" | "draft" | "review" | "compliance"

graph = StateGraph(AgentState)
graph.add_node("router", route_intent)
graph.add_node("research", research_agent)
graph.add_node("draft", draft_agent)
graph.add_node("review", review_agent)
graph.add_node("compliance", compliance_agent)

graph.add_conditional_edges("router", route_intent, {
    "research": "research",
    "draft": "draft",
    "review": "review",
    "compliance": "compliance",
})
graph.set_entry_point("router")

4.4 LLM Integration & Prompt Strategy

Primary model: Claude Sonnet/Opus via Anthropic API. Fallback: Gemini 2.5 Pro (useful when full-document context is needed — 1M+ token window).

The system prompt enforces grounded, cited responses:

You are a senior Brazilian legal assistant. Your role is to provide accurate,
well-cited legal analysis.

RULES:
1. Answer ONLY using the provided context documents. If the context does not
   contain sufficient information, say so explicitly.
2. CITE every claim: include the document title, article/section number,
   and page reference. Format: [Source: <document>, Art. <number>].
3. NEVER fabricate statutes, case numbers, or legal principles.
4. When citing case law, include the full case identifier
   (e.g., "STF, RE 123456/SP, Rel. Min. X, j. 01/01/2025").
5. Use formal legal Portuguese appropriate for professional legal documents.
6. If multiple sources conflict, present both positions with citations.

4.5 Citation System

Citations are first-class in this architecture:

LlamaIndex CitationQueryEngine wraps each source chunk with a numbered reference and injects them into the prompt.
The LLM response contains inline [Source N] markers.
The frontend renders these as clickable links that expand to show the original chunk text, document title, and a link to the full document.

from llama_index.core.query_engine import CitationQueryEngine

query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=10,
    citation_chunk_size=512,
)

response = query_engine.query("...")
# response.source_nodes contains the cited chunks with metadata

4.6 Observability & Evaluation

Tracing (Arize Phoenix)

Every query flows through: user input → retrieval → reranking → LLM prompt → response. Each step is traced with latencies, token counts, and intermediate outputs. This is essential for debugging bad answers ("why did it cite the wrong statute?").

Evaluation (RAGAS)

Before going to production, build a golden evaluation set of ~100 legal questions with known correct answers and expected source citations. Run RAGAS metrics:

Metric	What it measures
Faithfulness	Does the answer only contain claims supported by the retrieved context?
Answer Relevance	Does the answer actually address the question?
Context Precision	Are the retrieved chunks relevant to the question?
Context Recall	Did retrieval find all the chunks needed to answer correctly?
Citation Accuracy	(Custom) Do the cited source references match real documents in the knowledge base?

Target: Faithfulness > 0.95, Context Precision > 0.85 before any production deployment.

4.7 PageIndex Deep Reader

PageIndex is the second retrieval tier — a fully open-source (MIT license, github.com/VectifyAI/PageIndex) vectorless, reasoning-based system that builds a hierarchical tree index from a document and uses LLM agents to navigate it. Where Qdrant finds the right document from 20k, PageIndex reads that document like a human expert. It runs locally. The only cost is the LLM calls it makes — and we're already paying for Claude.

Why this matters for law: Traditional vector RAG chunks a 200-page acórdão into fragments and retrieves by similarity — losing structural context (which section does this clause belong to? what came before it?). PageIndex preserves the entire document structure as a navigable tree and returns answers with exact page ranges and section paths. This is precisely how a senior associate reads: scan the table of contents, drill into the relevant section, read the surrounding context.

How PageIndex works

Tree Generation (offline, once per document): PDF → LLM-powered analysis → hierarchical JSON tree where each node = section title + summary + exact page range (start/end index) + child nodes. Runs locally via run_pageindex.py. Output is a JSON file you cache.
Tree Search (at query time): Query → LLM reasoning agent walks the cached tree step-by-step: "Is this about 'Disposições Gerais' or the 'Acórdão'? Follow this branch..." → returns a traceable path with exact page citations.
Both steps use whatever LLM you configure (--model flag). We point it at Claude instead of the default gpt-4o.

What's open-source vs. paid API

Capability	Paid API provides	We cover it with	Gap?
Tree generation	Yes	PageIndex open-source repo (`run_pageindex.py`)	None
Tree search / retrieval	Yes	PageIndex open-source repo (cookbooks)	None
Multi-turn chat over a doc	Built-in Chat API	LangGraph agents (Review/Draft) maintain conversation state, ask sequential questions against the tree + Claude	None — same result, we own the logic
OCR (scanned PDFs)	Yes (paid feature)	LlamaParse (free tier or paid, already in our stack)	None — LlamaParse is better for batch ingestion anyway
Tree caching	Server-side, automatic	`tree_cache.py` — file_hash → JSON on disk	None — simpler, we own the cache
Streaming responses	Built-in	FastAPI WebSocket + Claude streaming API	None — already in our architecture
MCP integration	Yes	Not needed — we're building a product, not a plugin for other AI tools	N/A
Cost	$50/mo + $0.02/query	$0 (beyond LLM calls we already pay for)	We save money
Privacy	Data sent to VectifyAI servers	Full — nothing leaves our infra	We're more private

Every capability the paid API offers is either in the open-source repo already or handled by other tools in our stack (LangGraph, LlamaParse, FastAPI). We don't need the paid API.

Integration architecture

User asks about a specific document (uploaded or retrieved)
    │
    ▼
┌───────────────────────────────────────────────────┐
│  PageIndex Document Manager (our wrapper)          │
│                                                    │
│  1. Check tree cache (file_hash → cached JSON?)    │
│     ├─ HIT  → load cached tree JSON                │
│     └─ MISS → run PageIndex tree generation        │
│              (local, uses Claude API)               │
│              → save tree JSON to cache              │
│                                                    │
│  2. Tree search (query time)                       │
│     → LLM walks the tree with our question         │
│     → Returns relevant nodes + page ranges         │
│     → Feed those pages as context to Claude         │
│     → Claude answers with exact citations           │
└───────────────────────────────────────────────────┘

Setup

# Clone PageIndex into the project (or add as git submodule)
git clone https://github.com/VectifyAI/PageIndex.git
pip install -r PageIndex/requirements.txt

PageIndex uses an OpenAI-compatible API by default. We configure it to use Claude (via a proxy or by modifying the LLM call — the code is open, we can swap the client).

Tree generation (offline, cached)

# Generate tree for a single PDF (outputs JSON)
python3 PageIndex/run_pageindex.py \
    --pdf_path data/legal_docs/stf_acordao_12345.pdf \
    --model gpt-4o \
    --if-add-node-summary yes

Output is a JSON tree like:

{
  "title": "Acórdão - RE 123456/SP",
  "node_id": "0001",
  "start_index": 1,
  "end_index": 5,
  "summary": "Recurso extraordinário sobre repercussão geral...",
  "nodes": [
    {
      "title": "Relatório",
      "node_id": "0002",
      "start_index": 5,
      "end_index": 12,
      "summary": "O Ministro Relator apresentou...",
      "nodes": []
    },
    {
      "title": "Voto do Relator",
      "node_id": "0003",
      "start_index": 12,
      "end_index": 28,
      "summary": "O Relator sustentou que...",
      "nodes": [...]
    }
  ]
}

Tree caching

Tree generation takes 30-120s per document (LLM calls to build the tree). The tree JSON is cached by file hash — build once, reuse forever.

import hashlib
import json
import subprocess
from pathlib import Path

TREE_CACHE_DIR = Path("data/pageindex_cache")

def get_or_build_tree(pdf_path: str) -> dict:
    """Return cached tree or generate a new one via PageIndex."""
    file_hash = hashlib.sha256(Path(pdf_path).read_bytes()).hexdigest()
    cache_file = TREE_CACHE_DIR / f"{file_hash}.json"

    if cache_file.exists():
        return json.loads(cache_file.read_text())

    output_path = f"/tmp/pageindex_{file_hash}.json"
    subprocess.run([
        "python3", "PageIndex/run_pageindex.py",
        "--pdf_path", pdf_path,
        "--model", "gpt-4o",
        "--if-add-node-summary", "yes",
    ], check=True)

    tree = json.loads(Path(output_path).read_text())
    cache_file.parent.mkdir(parents=True, exist_ok=True)
    cache_file.write_text(json.dumps(tree))
    return tree

Caching strategy	When	How
Pre-build (batch)	High-value docs (top 500 most-queried statutes, landmark STF decisions)	Nightly job generates trees, stores JSON in cache dir
Build-on-first-access	Any doc the user clicks "Deep Read" on	Generate tree on demand, cache for future queries
Skip	Simple lookups that Qdrant handles fine	Agent decides not to escalate (most queries stay in Tier 1)

Deep read (query time)

Once we have the tree JSON, the deep-read step feeds it to Claude along with the question. Claude navigates the tree, identifies the relevant section(s), and we extract the actual page content for those page ranges to build the final answer.

import anthropic
from pathlib import Path

client = anthropic.Anthropic()

def deep_read(tree: dict, pdf_pages: dict, question: str) -> str:
    """
    Use the PageIndex tree to answer a question about a document.

    Args:
        tree: The PageIndex tree JSON for this document
        pdf_pages: Dict mapping page_index → page text content
        question: The user's question
    """
    tree_json = json.dumps(tree, indent=2, ensure_ascii=False)

    navigation_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"You are navigating a legal document's structure to find "
                f"the sections relevant to this question: {question}\n\n"
                f"Document tree structure:\n{tree_json}\n\n"
                f"Return ONLY the node_ids and page ranges (start_index, "
                f"end_index) of the most relevant sections. Format as JSON."
            ),
        }],
    )
    relevant_nodes = parse_relevant_nodes(navigation_response)

    context_pages = []
    for node in relevant_nodes:
        for page_idx in range(node["start_index"], node["end_index"] + 1):
            if page_idx in pdf_pages:
                context_pages.append(
                    f"[Page {page_idx}]\n{pdf_pages[page_idx]}"
                )

    answer_response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=(
            "You are a senior Brazilian legal assistant. Answer using ONLY "
            "the provided page content. CITE exact page numbers and section "
            "titles for every claim."
        ),
        messages=[{
            "role": "user",
            "content": (
                f"Question: {question}\n\n"
                f"Relevant document sections:\n{''.join(context_pages)}"
            ),
        }],
    )

    return answer_response.content[0].text

The two-tier retrieval flow (end-to-end)

This is the core pattern: Qdrant finds, PageIndex reads.

async def research_with_deep_read(question: str, filters: dict = None):
    """
    Tier 1: Qdrant corpus search → top documents
    Tier 2: PageIndex deep read on the best match
    """
    # --- Tier 1: Fast corpus search (Qdrant) ---
    retriever = index.as_retriever(
        similarity_top_k=10,
        sparse_top_k=10,
        vector_store_query_mode="hybrid",
        filters=build_metadata_filters(filters),
    )
    nodes = await retriever.aretrieve(question)
    reranked = reranker.postprocess_nodes(nodes, query_str=question)

    top_doc = reranked[0]
    doc_path = top_doc.metadata["file_path"]
    doc_title = top_doc.metadata["document_title"]

    # --- Tier 2: Deep read (PageIndex, runs locally) ---
    tree = get_or_build_tree(doc_path)
    pdf_pages = extract_pages(doc_path)
    deep_answer = deep_read(tree, pdf_pages, question)

    return {
        "answer": deep_answer,
        "source_document": doc_title,
        "qdrant_context": [n.text for n in reranked[:3]],
        "retrieval_path": "qdrant → rerank → pageindex deep read",
    }

When the agent escalates to PageIndex

The LangGraph agent router uses these heuristics to decide Tier 1 vs Tier 2:

Signal	Action
User clicks "Review this document" / uploads a file	Always Tier 2 — go directly to PageIndex
User asks a cross-corpus question ("all STF decisions on X")	Tier 1 only — Qdrant with metadata filters
User asks about a specific statute/case ("Article 5 of the Constitution")	Tier 1 → Tier 2 — Qdrant finds the doc, PageIndex reads it deeply
Agent detects Qdrant retrieval is low-confidence (reranker scores < threshold)	Escalate to Tier 2 — deep read on top candidate for better extraction
Draft agent needs precise clause text for a petition	Tier 1 → Tier 2 — find sources, then deep-read for exact quotable text

5. Phased Implementation Plan

Phase 1 — Foundation (Weeks 1-2)

Goal: Working RAG pipeline with 100 test documents, queryable via CLI.

Task	Details
Project scaffolding	Monorepo setup, dependency management, Docker Compose for Qdrant
LlamaParse integration	Parse 100 sample Brazilian legal PDFs, validate structural integrity
Chunking pipeline	Implement hierarchical chunking with metadata extraction
Qdrant setup	Deploy locally, create collection with hybrid index
BGE-M3 embedding	Embed and store chunks
Basic query engine	Hybrid retrieval + Cohere rerank + Claude synthesis
Citation engine	Integrate CitationQueryEngine, verify source traceability
CLI testing	Run 20 test legal queries, manually evaluate accuracy

Exit criteria: Can ask a legal question and get a cited, accurate answer from the 100-doc corpus.

Phase 2 — Scale, Agents & Deep Reader (Weeks 3-6)

Goal: Full 20k document corpus ingested. Agent router operational. PageIndex deep reader integrated. API backend serving requests.

Task	Details
Batch ingestion	Async pipeline to ingest all 20k documents into Qdrant (expect 4-8 hours)
Metadata enrichment	Automated extraction of court, jurisdiction, date, status
PageIndex integration	Integrate open-source PageIndex repo, implement tree cache manager, pre-build trees for top 100-500 high-value docs
Two-tier retrieval	Wire Qdrant (Tier 1) → PageIndex (Tier 2) escalation pipeline with agent-controlled switching
FastAPI backend	REST + WebSocket endpoints for chat, document upload, streaming, deep-read
Agent router	LangGraph-based intent classification and routing with Tier 1/2 tool selection
Research agent	Full pipeline: hybrid search → rerank → optional PageIndex escalation → cited synthesis
Draft agent	Multi-step: research → PageIndex deep-read for precise clause extraction → outline → draft → self-review
Review agent	Uploaded doc → PageIndex tree build + agentic chat → clause extraction → risk flagging. Compliance rules from Qdrant in parallel.
HyDE + multi-query	Advanced retrieval techniques for ambiguous queries
RAGAS evaluation	Build 100-question golden set, run baseline evaluation. Compare Tier 1-only vs Tier 1+2 accuracy.

Exit criteria: All 20k docs indexed. PageIndex trees cached for top docs. Research, Draft, and Review agents functional via API with two-tier retrieval. RAGAS faithfulness > 0.90.

Phase 3 — Frontend & Production Hardening (Weeks 7-9)

Goal: Full UI live. Observability in place. Production-ready deployment.

Task	Details
Next.js frontend	Chat UI, action buttons, document upload, inline citations with source viewer
Deep-read panel	PageIndex tree visualization — show the reasoning path the agent took through the document structure. Clickable tree nodes expand to page content.
Streaming responses	WebSocket-based token streaming for real-time UX (including PageIndex stream_metadata for showing "Searching section X..." progress)
Auth & multi-tenancy	User accounts, firm-level document isolation
Observability	Arize Phoenix tracing on all queries
Error handling	Graceful fallbacks, rate limiting, input validation
RAGAS tuning	Iterate on chunking, retrieval, and prompts until faithfulness > 0.95
Deployment	Docker Compose for single-server production, CI/CD pipeline, staging environment
Security audit	Ensure no document leakage across tenants, API key rotation

Exit criteria: End-to-end working product. Traced, evaluated, deployed.

Phase 4 — Advanced Features (Weeks 10-14)

Goal: Differentiated features that move beyond a basic RAG chatbot.

Feature	Details
GraphRAG	Knowledge graph of citation relationships. "Find all cases that cite Súmula 331."
Compliance agent	Automated regulatory compliance checking against uploaded contracts
Document comparison	Side-by-side diff of two legal documents with AI-highlighted differences
Scheduled monitoring	Alert when new legislation or case law affects a client's matter
Fine-tuned reranker	Train a domain-specific reranker on Brazilian legal query-document pairs
Conversation memory	Multi-turn legal research sessions with context persistence
Export	Generate formatted PDF petitions, opinions, and memos

6. Cost Estimates

Most components in this stack can run locally or have generous free tiers. The only hard cost is the LLM API (Claude). Everything else is a choice between "free but you run it yourself" and "paid API for convenience."

What costs money and why

Component	What it is	Paid API cost	Free / local alternative
Claude API	The LLM that generates answers. 3rd-party API (Anthropic). No way around this — you need a frontier model for legal reasoning.	~$3 per 1M input tokens, ~$15 per 1M output tokens (Sonnet). Roughly $0.01-0.05 per query depending on context size.	None at this quality. Gemini 2.5 Pro has a free tier with rate limits. Open-source LLMs (Llama, Mistral) are an option but weaker on legal citation fidelity.
LlamaParse	Cloud PDF parser by the LlamaIndex team. You upload PDFs, it returns structured markdown. 3rd-party API.	$0.003/page (paid tier).	Free tier: 1000 pages/day — enough for Phase 1. For batch ingestion, use open-source Unstructured.io locally ($0) or PyMuPDF + custom parsing.
Cohere Rerank	Cloud reranking API by Cohere. Scores retrieved chunks for relevance. 3rd-party API.	~$1 per 1000 queries.	Run BGE-reranker locally ($0). Same concept, no API call. Slightly worse quality but free and private.
PageIndex	Open-source tree-index generator. Runs locally. MIT license.	N/A — we run it ourselves.	Free. Cloned from GitHub. The only cost is the LLM calls it makes during tree generation — which go through our existing Claude/OpenAI key.
Qdrant	Vector database. We self-host it (Docker container).	N/A — we run it ourselves.	Free. Runs as a Docker container on your machine. No cloud service needed.
BGE-M3	Embedding model. Runs locally on your machine.	N/A — we run it ourselves.	Free. Downloads once (~2GB), runs on CPU or GPU.
Infrastructure	The server(s) everything runs on.	Depends on where you deploy.	Your laptop ($0) for dev. A single VPS for production.

Scenario 1: Getting started — first user, 100 docs (Phase 1)

Everything runs on your laptop or a single cheap VPS. No Kubernetes, no managed services.

Component	What you use	Monthly cost
LLM APIs (Claude + OpenAI)	Claude for answers (~200 queries/month). OpenAI for PageIndex tree generation (gpt-4o, ~10-20 trees while testing).	~$5-15
LlamaParse	Free tier (1000 pages/day), plenty for 100 docs	$0
Cohere Rerank	Skip — use local BGE-reranker instead	$0
PageIndex	Open-source, runs locally. Tree gen uses OpenAI calls counted above.	$0
Qdrant	Docker container on your machine	$0
BGE-M3	Runs locally	$0
Infrastructure	Your laptop, or a $20/mo VPS if you want it online	$0-20
Total		~$5-35/month

One-time setup cost: $0. You sign up for free tiers, clone repos, docker run, and start building.

Note: PageIndex currently defaults to gpt-4o for tree generation. The code is open-source so we can modify it to use Claude instead, eliminating the OpenAI dependency entirely. This is a Phase 2 task.

Scenario 2: At scale — small law firm, 20k docs, 10 active lawyers

"Scale" here means: 10 lawyers using the system daily, ~1,000 queries/month each = 10,000 queries/month total. About 20-30% of those escalate to PageIndex deep reads. The 20k document corpus is fully ingested.

Component	What you use	Monthly cost
LLM APIs (Claude + OpenAI)	Claude: ~10k queries/month + deep-read synthesis. OpenAI: PageIndex tree generation for new docs (or swap to Claude once modified).	$200-500
LlamaParse	One-time bulk ingestion already done. Incremental for new docs (~500 pages/month)	~$2
Cohere Rerank	10k queries/month at ~$1/1k queries. Or stay with local BGE-reranker for $0.	$0-10
PageIndex	Open-source, runs locally. Tree gen + deep-read LLM costs already included in Claude line.	$0
Qdrant	Self-hosted on same server	$0
BGE-M3	Runs on same server	$0
Infrastructure	Single dedicated server (8-16 CPU, 32-64GB RAM, GPU optional). Hetzner/OVH: ~$50-100. AWS/GCP: ~$100-200.	$50-200
Total		~$250-700/month

One-time costs at this stage: - LlamaParse bulk ingestion of 20k docs: ~$60-100 (one-time, or $0 if using Unstructured.io locally) - PageIndex tree pre-generation for top 500 docs: LLM cost ~$20-50 (one-time, trees cached as JSON forever)

Scenario 3: SaaS for multiple law firms (future)

This is when you'd consider Kubernetes, managed databases, multi-tenancy, and bigger infrastructure. Not relevant until you're serving 5+ firms and charging for it. At that point you're generating revenue and infrastructure costs ($1,000-2,500/month) are a line item, not a concern.

How to minimize costs during development

Use free tiers aggressively: LlamaParse (1000 pages/day), Gemini 2.5 Pro free tier for LLM experiments.
Run everything local: BGE-M3, BGE-reranker, Qdrant (Docker), PageIndex, Unstructured.io — all free, all open-source, all private.
Only pay for Claude: The one thing you can't replace cheaply. At dev volumes (~200 queries/month), it's $5-10.
Cache PageIndex trees: A tree, once built, is a JSON file reused for every subsequent query against that document. Build once, query forever. The only cost is the LLM calls during tree generation.

7. Key Design Decisions & Rationale

Why BGE-M3 over Voyage AI / OpenAI embeddings? BGE-M3 runs locally (no per-call cost at 20k docs), produces both dense and sparse vectors in one model, and has excellent Portuguese/multilingual support. Voyage voyage-law-2 is trained on legal corpora and may yield better retrieval for English legal text, but BGE-M3 wins for Brazilian Portuguese. We can A/B test both during Phase 2.

Why Qdrant over Pinecone? Self-hosted = full data sovereignty (critical for law firms). Native hybrid search. No vendor lock-in. Pinecone serverless is a valid alternative if the team prefers managed infrastructure.

Why LangGraph for agents instead of LlamaIndex agents? LlamaIndex excels at retrieval but its agent abstraction is less mature for complex multi-step workflows (research → draft → review cycles). LangGraph's stateful graph model maps naturally to legal workflows with conditional branching, human-in-the-loop review, and retry logic.

Why Claude over GPT/Gemini as primary LLM? Claude leads on instruction following, citation fidelity, and safety (refusing to hallucinate when context is insufficient). Gemini 2.5 Pro is kept as fallback for scenarios requiring massive context windows (entire case files in a single prompt).

Why not fine-tune the LLM? RAG with a strong base model outperforms fine-tuning for knowledge-grounded tasks where the source documents change frequently. Fine-tuning is brittle (needs retraining on new legislation), expensive, and doesn't provide per-claim citations. RAG gives us all three: freshness, cost efficiency, and traceability.

Why PageIndex as Tier 2 instead of replacing Qdrant entirely? PageIndex is brilliant for deep single-document reasoning (98.7% on FinanceBench), but it isn't designed for 20k-doc corpus search. Tree generation is LLM-heavy (~30-120s per doc) and multi-document cross-search isn't its strength. Qdrant gives us sub-50ms corpus-wide search with metadata filtering — something PageIndex cannot do. The two-tier model gives us the best of both: instant corpus routing (Qdrant) + expert-level document reading (PageIndex).

Why not use PageIndex for every query? Latency. A Qdrant hybrid search takes milliseconds. A PageIndex deep read makes 2+ LLM calls and takes seconds. For "show me all STF súmulas from 2025" the user wants instant results from the corpus — PageIndex adds no value there. But for "analyze the risk clauses in this 200-page contract" it is transformatively better than chunk-based retrieval. The agent router makes this decision automatically.

Why self-host PageIndex instead of using their paid API? PageIndex is fully open-source (MIT license). The open-source repo handles tree generation and retrieval — which is all we need. We already have LangGraph for multi-turn agent workflows, LlamaParse for OCR, and FastAPI for streaming. The paid API ($50/mo) bundles convenience features we'd be duplicating. Self-hosting keeps our costs at $0 (beyond existing LLM calls), keeps full data privacy, and avoids vendor dependency.

8. Repository Structure

superlawyer/
├── IMPLEMENTATION.md          # This document
├── docker-compose.yml         # Qdrant + API + workers
├── PageIndex/                 # Git clone of VectifyAI/PageIndex (open-source, MIT)
├── backend/
│   ├── api/                   # FastAPI application
│   │   ├── main.py
│   │   ├── routes/
│   │   │   ├── chat.py        # Chat endpoint (REST + WebSocket)
│   │   │   ├── documents.py   # Document upload & management
│   │   │   └── health.py
│   │   └── middleware/        # Auth, rate limiting, CORS
│   ├── agents/                # LangGraph agent definitions
│   │   ├── router.py          # Intent classification + routing
│   │   ├── research.py        # Research agent
│   │   ├── draft.py           # Drafting agent
│   │   ├── review.py          # Document review agent
│   │   └── compliance.py      # Compliance checking agent
│   ├── retrieval/             # LlamaIndex retrieval engine (Tier 1)
│   │   ├── engine.py          # Query engine setup (hybrid + rerank)
│   │   ├── index.py           # Index management
│   │   └── prompts.py         # System prompts and prompt templates
│   ├── deep_reader/           # PageIndex deep reader (Tier 2, open-source)
│   │   ├── tree_builder.py    # Wrapper around PageIndex tree generation
│   │   ├── tree_cache.py      # File-hash → cached JSON tree persistence
│   │   ├── tree_search.py     # LLM-based tree navigation + page extraction
│   │   ├── deep_read.py       # End-to-end deep-read pipeline (tree + Claude)
│   │   └── escalation.py      # Tier 1 → Tier 2 escalation logic + heuristics
│   ├── ingestion/             # Offline ingestion pipeline
│   │   ├── pipeline.py        # Main ingestion orchestration
│   │   ├── parser.py          # LlamaParse configuration
│   │   ├── chunker.py         # Hierarchical chunking
│   │   ├── metadata.py        # Metadata extraction & enrichment
│   │   ├── embedder.py        # BGE-M3 embedding
│   │   └── tree_prebuild.py   # Batch PageIndex tree generation for high-value docs
│   ├── eval/                  # Evaluation suite
│   │   ├── golden_set.json    # 100 question-answer pairs
│   │   └── run_ragas.py       # RAGAS evaluation runner
│   └── config.py              # Centralized configuration
├── frontend/                  # Next.js application
│   ├── app/
│   ├── components/
│   │   ├── ChatInterface.tsx
│   │   ├── ActionButtons.tsx
│   │   ├── CitationViewer.tsx
│   │   ├── DocumentUpload.tsx
│   │   ├── SourcePanel.tsx
│   │   └── TreePathViewer.tsx # PageIndex tree navigation + reasoning path display
│   └── ...
├── data/                      # Document storage (gitignored)
│   ├── legal_docs/
│   └── pageindex_cache/       # Cached tree index mappings (file_hash → doc_id)
├── scripts/
│   ├── ingest.py              # CLI entry point for ingestion
│   └── evaluate.py            # CLI entry point for evaluation
└── requirements.txt

9. Getting Started (Phase 1 Checklist)

# 1. Clone and setup
cd superlawyer
python -m venv .venv && source .venv/bin/activate

# 2. Install dependencies
pip install llama-index llama-index-core llama-parse \
    llama-index-embeddings-huggingface \
    llama-index-vector-stores-qdrant \
    llama-index-postprocessor-cohere-rerank \
    qdrant-client langgraph fastapi uvicorn anthropic

# 3. Clone PageIndex (open-source, MIT license)
git clone https://github.com/VectifyAI/PageIndex.git
pip install -r PageIndex/requirements.txt

# 4. Start Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

# 5. Set API keys
export LLAMA_CLOUD_API_KEY="..."    # LlamaParse (free tier: 1000 pages/day)
export ANTHROPIC_API_KEY="..."      # Claude (the only real cost)
export OPENAI_API_KEY="..."         # For PageIndex tree generation (uses gpt-4o by default)

# 6. Place 100 test PDFs in data/legal_docs/

# 7. Run ingestion (Tier 1: Qdrant)
python scripts/ingest.py

# 8. Test Tier 1 (corpus search)
python -c "from backend.retrieval.engine import query; print(query('...'))"

# 9. Test Tier 2 (PageIndex deep read) on a single doc
python3 PageIndex/run_pageindex.py --pdf_path data/legal_docs/sample_acordao.pdf
# → outputs tree JSON, then use deep_read() to query it

This document is the single source of truth for the SuperLawyer implementation. Update it as architectural decisions evolve.

Component	What it is	Paid API cost	Free / local alternative
Claude API	The LLM that generates answers. 3rd-party API (Anthropic). No way around this — you need a frontier model for legal reasoning.	~\(3 per 1M input tokens, ~\)15 per 1M output tokens (Sonnet). Roughly $0.01-0.05 per query depending on context size.	None at this quality. Gemini 2.5 Pro has a free tier with rate limits. Open-source LLMs (Llama, Mistral) are an option but weaker on legal citation fidelity.
LlamaParse	Cloud PDF parser by the LlamaIndex team. You upload PDFs, it returns structured markdown. 3rd-party API.	$0.003/page (paid tier).	Free tier: 1000 pages/day — enough for Phase 1. For batch ingestion, use open-source Unstructured.io locally ($0) or PyMuPDF + custom parsing.
Cohere Rerank	Cloud reranking API by Cohere. Scores retrieved chunks for relevance. 3rd-party API.	~$1 per 1000 queries.	Run BGE-reranker locally ($0). Same concept, no API call. Slightly worse quality but free and private.
PageIndex	Open-source tree-index generator. Runs locally. MIT license.	N/A — we run it ourselves.	Free. Cloned from GitHub. The only cost is the LLM calls it makes during tree generation — which go through our existing Claude/OpenAI key.
Qdrant	Vector database. We self-host it (Docker container).	N/A — we run it ourselves.	Free. Runs as a Docker container on your machine. No cloud service needed.
BGE-M3	Embedding model. Runs locally on your machine.	N/A — we run it ourselves.	Free. Downloads once (~2GB), runs on CPU or GPU.
Infrastructure	The server(s) everything runs on.	Depends on where you deploy.	Your laptop ($0) for dev. A single VPS for production.

Contents + −

IMPLEMENTATION.md