I built a text-to-SQL reporting agent that takes a natural language question and returns a validated SQL result and summary, reasoning over a 15+ table enterprise database. The pipeline is implemented in LangGraph with a Chainlit front-end and runs against Postgres.
Architecture
The agent is structured as a graph of specialist nodes rather than a single monolithic prompt:
- Schema resolver narrows the search to the tables and columns relevant to a given question, using an embedded vector store for few-shot example retrieval and semantic-layer mappings. This RAG step keeps the context the generator sees small and focused.
- Query generator drafts SQL grounded in the resolved schema and retrieved examples.
- Guardrails enforce structural constraints ā read-only statements, scoped tables, row/limit bounds ā before a query touches the database.
- Validator runs
EXPLAINand targeted sanity checks, and routes obvious failures back to the generator with a correction hint instead of surfacing them to the user. - Cache memoizes resolved schemas, retrieved examples, and full query results keyed on the semantic intent of the question rather than the raw string.
- Summarizer turns the result set into a short natural-language answer with the SQL and source rows attached for auditability.
Evaluation
I built an evaluation suite alongside the agent so changes to prompts, retrieval, or the graph could be measured rather than vibe-checked. It covers schema-resolution accuracy, SQL execution correctness, numeric agreement with ground-truth answers, and summary faithfulness. Running the suite on every change made it much easier to tell whether a new prompt actually helped or just moved failures around.
What Iād do differently
The biggest lesson was how much RAG quality dominates final accuracy. Early iterations over-invested in the generator prompt and under-invested in retrieval; swapping in better semantic-layer mappings and tighter few-shot examples moved the needle more than any generator change.