(E26) Building Production-Grade AI: Nikolai Grabner on Testing RAGs, LLMs, and the QA Mindset
In this episode of Developers Who Test, host Chris Harbert sits down with Nikolai Grabner, a Senior Software Engineer and Technical Lead at Enigma Solutions, to talk about what it actually takes to build and test AI systems that are ready for production. Nikolai opens by demystifying retrieval augmented generation (RAG), using the analogy of a knowledgeable judge who consults a specialist library (a vector database) when a question falls outside their general expertise. He explains why companies are increasingly building private, in-network RAG systems: to keep proprietary information out of third-party models like OpenAI and Anthropic while still giving employees a single, instant point of access for things like HR policy questions and onboarding knowledge.
Nikolai shares the origin story behind his own product, SAP Bot, which grew out of market research he did when founding Enigma Solutions. After hearing that many internal RAG systems were, in his words, not working properly, his QA instincts kicked in and he set out to prove a thoroughly tested private RAG could get close to the quality of the big public models. A central theme of the conversation is how testing AI is fundamentally different from traditional pass or fail test cases. Because the same prompt can return different answers each time, Nikolai built a scoring mechanism rooted in statistics, precision, and coverage to detect hallucination (making things up) and drift (staying on topic but giving wrong answers). Chris draws a parallel to Six Sigma and the idea of variability as the enemy of quality.
The two get into the practical realities of building with AI, including using tools like Prompt Foo to fire the same set of prompts at OpenAI, Anthropic, and Gemini and compare results, tuning the temperature for creativity, and learning hard lessons about performance. Nikolai recounts how discovering CUDA and offloading the LLM to the GPU cut his response times from five minutes down to about fifteen seconds with streamed output. They also swap cautionary tales about AI getting things subtly wrong: a coding tool that inverted passes and fails in test results, and an MCP server with start and end dates reversed that the LLM quietly worked around, leaving a hidden landmine in the system.
Much of the discussion centers on discipline. Nikolai argues that vibe coding can produce production-grade software, but only with clear requirements, a design spec, a roadmap, phased delivery, and regression testing after every change. He compares vibe coding to managing a junior dev team that still needs its work tested. Chris highlights how far modern tooling has come, pointing to Playwright MCP and the Testery MCP for running reliable end to end tests at scale and feeding results back to the LLM, and Nikolai contrasts that with the weeks it once took to script a single test in WinRunner back in 1999.
The episode closes on continuous quality and staying current. Nikolai makes the case for always-on testing of AI systems (since a single faulty document can skew an entire RAG), dedicated research and development teams, and giving testers room to run proofs of concept on test infrastructure. Both reflect on how quickly organizations can become dinosaurs in the AI era, the value of conferences like TestCon for learning what is genuinely cutting edge, and how that spirit of learning by doing is exactly why the podcast exists.
Key Topics:
- What retrieval augmented generation (RAG) is and how it extends an LLM with specialized knowledge
- Why companies build private, in-network RAGs with vector databases to protect proprietary data
- The origin of SAP Bot and applying a QA mindset to building AI products
- Testing LLMs statistically: scoring for precision, coverage, hallucination, and drift
- Using Prompt Foo to run prompts across OpenAI, Anthropic, and Gemini for multi-model comparison
- Performance gains from CUDA and GPU offloading, plus response streaming
- Real-world AI failure stories: inverted pass/fail results and reversed start/end dates in an MCP server
- Turning vibe-coded AI into production-grade systems through discipline, specs, and regression testing
- Modern testing tooling (Playwright MCP, Testery MCP) versus legacy tools like WinRunner
- Continuous AI testing, dedicated R&D teams, proofs of concept, and learning from conferences