Business

Most AI prototypes work brilliantly in demos. They impress investors, excite stakeholders, and generate genuine enthusiasm about what's possible. Then reality hits: the system fails unpredictably in production, costs spiral out of control, and the team discovers that their clever architectural decisions have painted them into a corner.
I recently completed a 4-week consulting engagement with a data analytics startup that found themselves in exactly this position. They had built a conversational AI platform designed to transform natural language queries into actionable insights from spreadsheet data. It is a genuinely valuable product in a market crying out for better solutions. The problem wasn't their vision. It was that their prototype, after months of development, was completing fewer than 60% of queries successfully.
What followed was not merely a technical rescue mission. It was an evidence-based transformation that delivered a 98% success rate, comprehensive knowledge transfer, and strategic guidance that extended well beyond the codebase. Here's what that actually looked like.
The founders had invested significant time building a three-agent AI system using a custom domain-specific language (DSL) for data transformations. The approach made theoretical sense: translate natural language queries into structured workflows, then execute those workflows against user data.
In practice, the system was struggling. The DSL was brittle: approximately 40% of queries failed due to malformed outputs, schema validation errors, or runtime exceptions. Simple queries were consuming 20,000 to 50,000 tokens, creating cost structures that would be unsustainable at scale. Every new transformation type required updates across multiple system components, creating a maintenance burden that was slowing development velocity.
The business impact was clear: the team couldn't confidently demo their product to potential customers or investors. They knew they needed to change direction but were uncertain which path to take—and wary of investing more months into another approach that might fail.
My initial scope was to refactor their existing agents through prompt engineering improvements and explore alternatives. What actually happened was more fundamental.
There's a temptation in consulting to demonstrate immediate value by diving straight into changes. I've learned this is usually a mistake. Week one was deliberately focused on understanding before doing.
I reviewed the codebase systematically, mapped the core components, and asked extensive clarifying questions. What seemed like slow progress was actually risk mitigation; that is understanding the existing landscape meant I could identify not just what was broken, but why, and which approaches were likely to succeed.
During this phase, I also established foundational improvements: modern dependency management, tightened linting configuration, continuous integration via GitHub Actions, and proper documentation. These might appear peripheral to the core problem, but they set the standard for quality and made everything that followed easier to test, deploy, and maintain.
The key insight from week one: the fundamental challenge wasn't the prompts. It was the DSL itself. Trying to fix the existing approach would deliver diminishing returns. But I needed evidence before recommending a major architectural pivot.
Here's where the engagement diverged from typical consulting. Rather than relying on intuition or anecdotal testing to guide a major decision, I built a comprehensive evaluation suite: 35 test cases spanning six business domains, including building management, supply chain logistics, electric vehicle sales, emissions tracking, Formula 1 statistics, and social media analytics.
This wasn't academic exercise. The evaluation suite enabled a rigorous head-to-head comparison: their existing DSL approach versus direct Python code generation. Each test case ran through a complete pipeline: query interpretation, code or workflow generation, execution, and result validation using an AI judge to assess correctness and completeness.
The findings were unambiguous. Code generation achieved a 98% success rate versus the DSL's sub-60% completion. More importantly, the code generation approach never produced syntax errors (a validation gate caught all issues before execution), while the DSL approach frequently crashed with exceptions.
This evidence transformed the strategic conversation. Instead of debating opinions about which approach might work better, we had data. The founders could make an informed decision to pivot, not based on my say-so, but on measurable proof that justified the investment in architectural change.
The de-risking principle here matters for any executive overseeing AI initiatives: demand evidence before major pivots. Evaluation suites aren't just for engineers. They're risk mitigation instruments that protect against expensive wrong turns.
With clear evidence supporting the pivot, week three involved building a production-ready code generation pipeline from scratch. This wasn't incremental improvement; it was architectural replacement.
The new system comprised four specialised AI agents working in sequence: a planner to interpret queries and generate execution plans, a code generator to produce Python data analysis code, a result verifier using AI-as-a-judge methodology to validate outputs, and a summariser to transform results into natural language responses.
Critically, the pipeline incorporated self-healing retry logic. When code compilation failed or results didn't match query intent, the system automatically fed error context back to the code generator with up to three retry attempts. This resilience meant the system could correct its own mistakes without human intervention.
One decision during this phase illustrates a broader principle worth highlighting. I initially selected OpenAI's most advanced reasoning model for the planning agent, attracted by its sophisticated capabilities. In testing, it proved unsuitable because it took over 60 seconds to answer simple queries and peoduced intermittent reliability issues.
I switched to a less advanced but faster model. Response times dropped to under 10 seconds for planning tasks. The system became genuinely usable.
The lesson: the newest, most powerful technology isn't always the right choice. The planning task didn't require maximum intelligence; it needed reliability and speed.
The final week focused on completing the integration and ensuring the founders could maintain, extend, and improve the system independently.
The complete pipeline achieved a 98% completion rate across the evaluation suite. Response times dropped to 20-30 seconds for most queries (including retries when needed). The codebase included 79 passing tests: 58 unit tests and 21 integration tests, plus the full 35-test evaluation suite per domain.
Beyond the technical deliverables, I produced comprehensive handover documentation covering architecture patterns, agent design decisions, testing strategies, and configuration options. Good consulting means going beyond building things. It means ensuring competent knowledge transfer so the client isn't dependent on you after you leave.
What distinguished this engagement from pure technical consulting was the strategic guidance woven throughout our weekly discussions.
Go-to-Market Positioning: Through our conversations, we identified that the financial operations market, specifically fractional CFOs and financial operations teams, presented a compelling opportunity. These professionals work across fragmented data ecosystems (client spreadsheets, accounting exports, bank CSVs), making a spreadsheet-native AI solution particularly valuable. I provided competitive landscape analysis comparing their positioning against Excel, Tableau, Power BI, and emerging AI tools like Microsoft Copilot.
I advised they consider vertical specialisation in financial operations rather than horizontal general-purpose positioning. Generic tools will struggle to match domain-specific expertise, and a clear value proposition accelerates market traction.
Product Evolution Roadmap: I documented nine prioritised areas for future development, including security-critical items like code execution sandboxing, scalability improvements for metadata handling, and technology strategy recommendations. Each area included specific suggestions, rationale, and priority assessment.
Technology Strategy: I advised on model selection principles, i.e. when to invest in expensive, sophisticated models versus cost-effective alternatives. I recommended deferring major front-end investment until product-market fit was validated, and offered to connect them with a developer who could build a customer-facing interface at reasonable cost when the time was right.
This strategic layer matters because technical excellence without business context is incomplete. A production-ready system that targets the wrong market or lacks a defensible position is still a failure.
The quantitative outcomes tell a clear story:
- Success rate: Improved from less than 60% to 98%
- Response times: Reduced from 60+ seconds to 20-30 seconds
- Test coverage: 79 passing tests plus comprehensive evaluation suite
- System reliability: Zero syntax errors reaching execution (self-healing validation)
- Maintainability: Full documentation and knowledge transfer completed
But the qualitative transformation matters equally. The founders moved from a position of uncertainty—unsure whether to continue investing in a struggling approach or pivot without evidence—to confidence. They now have a production-ready system, clear strategic direction, and the capability to iterate independently.
If you're overseeing AI initiatives or considering engaging external AI expertise, these principles apply broadly:
Demand evidence before major pivots. Evaluation suites aren't optional extras—they're the difference between informed decisions and expensive guesses. When someone recommends architectural change, ask what evidence supports that recommendation.
First-week investment pays dividends. Consultants who rush to demonstrate immediate value often create technical debt that compounds later. Understanding before action prevents costly mistakes and builds the foundation for confident execution.
Pragmatism over hype. The newest model, the most sophisticated approach, the cutting-edge technique—these aren't inherently superior. What matters is fitness for purpose. Sometimes the "older" technology is the right call.
Consulting should include strategy. Technical delivery without business context is incomplete. If your AI consultant isn't asking about your market positioning, competitive landscape, and go-to-market approach, you're getting half the value.
Knowledge transfer is non-negotiable. Can your team maintain the system after the consultant leaves? If not, you've bought a dependency, not a solution.
This engagement transformed a struggling AI prototype into a production-ready system with a 98% success rate, comprehensive test coverage, and a clear strategic direction. The investment was 4-6 weeks of focused work; the return was a platform the founders can confidently take to market.
The methodology that enabled this transformation—evidence-based decision making, foundations before action, pragmatic technology selection, and strategic advisory alongside technical delivery—applies to AI initiatives at any scale. Whether you're leading an early-stage startup or an enterprise AI transformation, these principles determine whether your investment translates into genuine business value.
If you're facing similar challenges such as an AI system that works in demos but struggles in production, uncertainty about which technical direction to pursue, or the need for strategic clarity alongside technical execution, I'd welcome a conversation about how this approach might apply to your situation.

AI Agents
April 8, 2025
In this post I review the exciting paper Generative Agents: Interactive Simulacra of Human Behavior. Read on to also be pleasantly surprised by what AI can do!
Read more

AI Agents
April 8, 2025
We share our conviction that speed is the immediate lever we can pull to build better, more responsive and resilient AI Agent applications.
Read more

AI Agents
April 8, 2025
In this post I review the paper Communicative Agents for Software Development. It's an ambitious piece of work. Read on to learn more.
Read more