January 7, 2025 Rui Mendes

NLP Goes to Market: What We Learned from Our First Fund

My PhD was in computational linguistics. My dissertation was about cross-lingual transfer learning — getting a model trained on English to generalize reasonably well to Portuguese without bilingual training data. My committee was satisfied. Industry was mostly indifferent. The work was technically rigorous and commercially irrelevant. This is a common story in NLP research, and it shaped how I think about the gap between research-grade language models and production marketing applications — a gap I've watched closely since joining Telhaverde and now run technical diligence on for every new potential investment.

Fund I ran from 2021 through roughly 2023 in deployment terms. We backed companies building on top of language models at a time when "GPT-3 wrapper" was both an accusation and a fair description of many products on the market. Looking back, the companies from that period that are performing well share characteristics we didn't fully articulate as a thesis at the time — we were pattern-matching on founder quality and market intuition more than a rigorous technical framework. Fund II forced us to be more precise.

The failure modes we saw in Fund I

Several categories of technical failure appeared repeatedly among companies in the 2021–2023 AI MarTech cohort — not necessarily in our portfolio, but in the broader landscape we were evaluating and tracking.

The most common was what I'd call prompt brittleness. A founder would demonstrate a product that generated excellent marketing copy in controlled demos. The demos worked because the prompts were carefully engineered for specific input types. When customers tried to use the product with their own content — different product categories, different brand voices, different target markets — the quality degraded sharply and inconsistently. The problem wasn't the underlying model; it was that the product had no robust system for handling out-of-distribution inputs. It was good at the demonstration environment and fragile everywhere else.

The second failure mode was evaluation blindness. The founders building content generation tools often had no rigorous way to measure whether their outputs were good. They relied on human judgment ("this feels right") or indirect proxies like user engagement. Without a principled evaluation framework — what NLP researchers call an eval suite — they couldn't systematically improve their product. Every model update was a guess. Every customer complaint was a surprise. The discipline of evaluation that's standard in academic NLP wasn't being imported into product development.

Third was fine-tuning misconceptions. Many founders believed that fine-tuning a foundation model on their customer's domain data would automatically create the quality and brand consistency they were promising. Fine-tuning helps, but it doesn't solve the underlying alignment problem between "generating text that sounds plausible" and "generating text that performs in a specific marketing context for a specific audience." The customers who expected fine-tuning to be a magic solution were consistently disappointed.

What changed between 2021 and now

The foundation models improved. But more importantly for our evaluation work, the tooling around production deployment matured significantly. By 2024, the interesting technical work in AI MarTech wasn't "can you build a language model" — it was "can you build the scaffolding around a language model that makes it trustworthy in production."

That scaffolding includes several components we now explicitly evaluate in diligence. Retrieval-augmented generation (RAG) architectures that ground outputs in the customer's proprietary brand documentation, style guides, and historical performance data — reducing hallucination risk and improving brand consistency. Structured output validation layers that enforce format constraints and catch obvious quality failures before they reach human review. Confidence scoring and routing logic that escalates low-confidence outputs to human editors rather than silently producing bad content. And evaluation pipelines — automated test suites that run on every model or prompt change to catch regressions.

The companies that have this scaffolding in place are fundamentally different products from the ones that don't. They're more defensible, more scalable, and more trustworthy in enterprise procurement. And they're much harder to build. The market is beginning to bifurcate between AI writing tools and AI content infrastructure — the former is a commodity feature, the latter is where durable businesses are being built.

The gap between "NLP researcher" and "production ML engineer" matters

One thing I've noticed doing technical diligence is that the specific background of the technical co-founder strongly predicts which failure modes a company is vulnerable to. This isn't a criticism of any background — it's a recognition that different training environments optimize for different things.

NLP researchers are trained to evaluate model quality rigorously and to think carefully about what a model is actually doing. They often underestimate the systems engineering challenges of production deployment — latency requirements, cost optimization, failure mode handling at scale, monitoring. Production ML engineers are trained to ship reliable systems, but may not have the principled model evaluation discipline that prevents quality drift over time.

The founders we find most interesting are the ones who span both worlds — or who have co-founders that fill the gap between them. A PhD who spent three years deploying models in a production environment before founding a company is rare. When we find one, the technical diligence conversations are meaningfully different: they already have opinions about eval frameworks, they've already burned themselves on fine-tuning misconceptions, they already know that prompt engineering is necessary but not sufficient.

We're not saying academic NLP background is a disqualifier, nor that production engineering background is sufficient on its own. The point is that the best technical founders in this space have synthesized both — and that synthesis is one of the strongest signals we look for.

What we'd do differently in Fund I if we were starting over

More rigorous eval framework assessment in diligence. We should have asked every technical founder in 2021: "How do you know your model is performing well? Walk me through your evaluation setup." That single question would have surfaced the evaluation blindness problem much earlier.

More attention to the production architecture, not just the demo. The demo environment and the production environment are very different things. We should have asked to see the system design document — how does the product behave when inputs are messy, when the model produces a low-confidence output, when a customer's brand voice is unusual? Founders with good answers to those questions had fundamentally different products from founders who hadn't thought about them.

Less weight on how impressive the demo outputs were. The best demo outputs often came from the most prompt-brittle products. We learned to be more suspicious of flawless demos and more interested in how the product handled adversarial or out-of-distribution inputs — including inputs we'd provide ourselves rather than those the founder had prepared.

The companies in the Fund I cohort that are performing best are the ones that built evaluation infrastructure early, treated fine-tuning as one tool among many rather than a silver bullet, and hired production-experienced ML engineers alongside their research-background founders. Those patterns weren't fully articulated in our Fund I thesis. They're central to Fund II.

Back to Notes