AI Engineering Insights

Field notes from building production AI - the patterns, trade-offs, and lessons behind the systems we ship.

Shipping a RAG Assistant Over 12M Records in Six Weeks

6 min read

RAG
LLM
LangChain
FastAPI

Retrieval-augmented generation (RAG) is the fastest path from "we have a lot of data" to "anyone can ask questions about it in plain English." The pattern is simple to describe and easy to get wrong: retrieve the most relevant context for a question, hand it to an LLM with tight instructions, and return a grounded answer instead of a hallucination.

When we built a conversational interface over security ratings for more than 12 million companies, the hard parts were not the model calls. They were chunking and indexing the data so retrieval surfaced the right records, constraining the prompt so the model answered only from retrieved context, and serving it all behind a FastAPI layer fast enough to feel conversational. We deployed on AWS Kubernetes and reached production in six weeks.

The lessons that generalize: invest in retrieval quality before prompt cleverness; measure groundedness with a small evaluation set from day one; and keep a human-readable trace of which sources fed each answer. Those three habits separate a demo that impresses in a meeting from a system people trust in daily work.

If you are weighing a RAG project, start with a narrow, high-value question your team asks constantly. A focused assistant that nails one workflow earns the right to expand.

Predictive ML in Production: Lessons From Credit Decisioning

7 min read

Machine Learning
PySpark
MLOps

A model that scores well in a notebook and a model that makes thousands of real decisions per second are different engineering problems. Production ML lives or dies on the data pipeline around the model, not just the model itself.

Building a credit decisioning engine that served thousands of financial institutions meant treating feature engineering as a first-class, large-scale system. We used PySpark to compute features consistently across training and serving, so the numbers a model saw in production matched what it learned from. The neural-network models were the visible part; the invariant feature pipeline was what made decisions consistent and auditable.

Two requirements shaped every choice: explainability and reproducibility. Regulated decisions must be explainable after the fact, and a decision you cannot reproduce is a decision you cannot defend. We versioned data, features, and models together so any score could be traced back to its inputs.

The takeaway for teams adding ML to high-stakes workflows: design the feature and monitoring layers first. The model is the easy part to swap; the pipeline is what you live with.

From 5 Minutes to 30 Seconds: Real-Time Computer Vision at City Scale

5 min read

Computer Vision
OpenCV
TensorFlow

Latency is the whole game in real-time computer vision. A model that detects a fire accurately but reports it five minutes later has failed at the only thing that mattered. Processing live video across hundreds of city cameras forced us to optimize the entire path, not just inference.

We built detection and license-plate recognition pipelines on OpenCV with TensorFlow and Keras, then cut fire-detection latency from five minutes to thirty seconds - close to an order of magnitude. The gains came from frame sampling strategy, parallelizing across camera streams, and pushing lightweight pre-filtering close to the source so the heavier models only ran on frames worth analyzing.

The principle that transfers to any live-video system: decide your latency budget first, then design backward from it. Every stage - capture, decode, pre-filter, infer, alert - spends part of that budget, and the cheapest stage to optimize is usually the work you avoid doing at all.

When live video has to drive an instant decision, accuracy and speed are not a trade-off you accept once. They are a balance you engineer at every layer.

AI Engineering Insights

Shipping a RAG Assistant Over 12M Records in Six Weeks

Predictive ML in Production: Lessons From Credit Decisioning

From 5 Minutes to 30 Seconds: Real-Time Computer Vision at City Scale

More from the team