September 24, 2025 Rui Mendes

Our Seed-Stage AI Product Diligence Checklist

I want to be upfront about what this is: a working document, not a definitive framework. We update our technical diligence questions regularly as the AI MarTech landscape evolves and as we learn from investments that worked and ones that didn't. What follows is the current version of the core questions I use in technical conversations with AI MarTech founders at seed stage — not a comprehensive review checklist but a set of targeted questions where the answers are most predictive of the underlying technical quality of the product.

Seed-stage technical diligence is different from Series A diligence in important ways. At seed, we often don't have deployed production systems to review. The code is early, the architecture is partially defined, and the founder's mental model of how the system works matters as much as what exists. The questions below are designed to surface the founder's technical judgment, not to audit a finished system.

On model architecture and dependency

"Walk me through how your system handles an input that's outside the distribution you trained or prompted for."

This question separates founders who have thought about robustness from founders who have thought about demo performance. A product that works beautifully on clean, well-structured inputs and falls apart on messy, real-world inputs is a demo, not a product. The answer I'm looking for has some form of: we detect out-of-distribution inputs, we route them differently, we degrade gracefully, we surface them for human review. The answer that concerns me: "our model is good enough that this hasn't been a problem."

"What's your model dependency stack, and what breaks if [specific upstream provider] changes their API or pricing?"

Many seed-stage AI products have significant dependencies on one or two foundation model providers. This isn't necessarily a problem — it's often the right architectural choice at early stage. But founders who can't articulate the dependency clearly, or who haven't thought about what happens when the dependency changes, are operating without a reasonable risk model. I'm looking for founders who have a clear-eyed view of their current dependencies and a thesis about when and how they'll reduce the ones that matter most.

On evaluation and quality measurement

"Show me your eval suite. What does it cover, how do you run it, and what does a regression look like?"

This is the question that most cleanly separates founders with engineering discipline from founders with good intuitions. Every AI product has output quality — the question is how it's measured. An eval suite at minimum covers: a set of representative inputs with expected output characteristics; automated checks that flag when outputs fall outside acceptable parameters; human evaluation samples on a regular cadence; and regression detection between model versions or prompt changes.

The absence of any formal eval suite at seed stage is a yellow flag, not a red flag — some products are genuinely too early for formalized eval. But the founder should be able to describe how they currently know when quality has degraded, and what their plan is for formalizing evaluation as the product matures. Founders who have no answer to this question — who rely entirely on customer complaints as their quality signal — are flying blind in a way that will cause problems at scale.

"How do you measure whether your personalization or content generation is actually performing better than a simpler baseline?"

This question probes whether the founder has thought rigorously about the value-add of their AI component. Many AI MarTech products have AI that is undeniably more sophisticated than a rule-based baseline — but sophistication and performance are different things. The founder should be able to describe a measurement framework that distinguishes the performance contribution of their AI from what a simpler system would produce. Without this, there's no systematic way to know whether the AI is adding value or just adding cost and complexity.

On data and the flywheel

"Describe how your product gets better over time as it processes more customer data. Be specific about what data, what signal, and what gets updated."

This is the data flywheel question, and the quality of the answer is one of the strongest signals I use. A strong answer has: specific data types collected during product usage (performance signals, user feedback, behavioral patterns); a clear mechanism by which that data improves the model or generation parameters (fine-tuning cycles, reinforcement from human feedback, embedding updates); and a timeline for when a customer would start seeing the compounding benefit. A weak answer has: vague gestures toward "the system learns from usage" without a specific mechanism.

The strong answer doesn't have to be fully built at seed stage — it's fine if it's planned rather than shipped. What matters is whether the founder has a specific, coherent theory of how the data flywheel works for their product. Founders who haven't thought through this in detail are likely building a product that won't compound, which means the competitive dynamics will be driven by feature parity rather than data advantage.

On production readiness and failure modes

"What happens in your system when the model returns a low-quality output? Walk me through the failure path."

Production AI systems fail. The question is whether they fail gracefully or catastrophically. A product that's production-ready has specific failure modes mapped out: what constitutes a failure (quality below threshold, latency too high, model returns an error), what happens when a failure is detected (fallback to a simpler output, routing to human review, user notification, retry logic), and how failures are logged for later analysis. A product that hasn't mapped its failure modes is production-ready for demos, not for customers.

"What's your inference cost at your target deployment volume, and what's the margin structure at that volume?"

This question surfaces whether the founder has done the unit economics math for their business model at scale. Many seed-stage AI products have attractive margins at low usage volumes and collapse at scale because the inference costs weren't modeled carefully. The founder should be able to give approximate numbers: cost per thousand API calls at their target model, expected volume at contract value X, and what that implies for gross margin. Founders who haven't thought through this are selling a product whose economics they don't understand.

What makes a conversation go well versus poorly

The conversations that give me high confidence in a technical founder share a few characteristics: they're comfortable saying "we haven't built that yet, but here's our plan"; they can describe failure modes without being defensive about them; they have opinions about technical tradeoffs rather than treating every choice as obvious; and they've been surprised by something in production that changed how they think about the architecture.

That last one — having been surprised — is actually a positive signal. It means they've operated a real system with real users under real conditions. Founders who have never been surprised by their production system usually haven't had enough production. The surprises teach you what the demos don't.

The conversations that concern me are the ones where every answer is confident and complete, every question has an obvious answer, and there's no evidence of having encountered a hard problem. At seed stage, that usually means the product hasn't been deployed seriously enough to reveal the hard problems yet. Which means they're still ahead of the founders who've already found and solved them.

Back to Insights