Shipping a RAG Assistant Over 12M Records in Six Weeks
6 min read
- RAG
- LLM
- LangChain
- FastAPI
Retrieval-augmented generation (RAG) is the fastest path from "we have a lot of data" to "anyone can ask questions about it in plain English." The pattern is simple to describe and easy to get wrong: retrieve the most relevant context for a question, hand it to an LLM with tight instructions, and return a grounded answer instead of a hallucination.
When we built a conversational interface over security ratings for more than 12 million companies, the hard parts were not the model calls. They were chunking and indexing the data so retrieval surfaced the right records, constraining the prompt so the model answered only from retrieved context, and serving it all behind a FastAPI layer fast enough to feel conversational. We deployed on AWS Kubernetes and reached production in six weeks.
The lessons that generalize: invest in retrieval quality before prompt cleverness; measure groundedness with a small evaluation set from day one; and keep a human-readable trace of which sources fed each answer. Those three habits separate a demo that impresses in a meeting from a system people trust in daily work.
If you are weighing a RAG project, start with a narrow, high-value question your team asks constantly. A focused assistant that nails one workflow earns the right to expand.