Generative AI

Lessons from 9 AI Product Teams: Emerging Themes from Just Now Possible

Stack Overflow built four versions of their AI-powered search, then killed it. Arize's team started building when "agent frameworks" didn't even exist. eSpark's former teachers are now writing eval code in Python.

Over the past few months, I've had the privilege of sitting down with product teams from several companies who are building AI products in production. These aren't theoretical conversations—these are teams shipping real AI features to real customers, dealing with all the messy reality that comes with it.

Through these conversations on my podcast Just Now Possible, I've noticed patterns emerging. Teams across different industries, building different types of AI products keep bumping into similar challenges and discovering similar solutions. Here's what I'm seeing:

Small, cross-functional teams are shipping fast.
Nobody started as an AI expert.
Domain expertise drives product decisions.
Starting narrow beats starting broad.
Architectures are getting more sophisticated—but in different ways.
Evals evolve from simple to sophisticated.
The trendy tool isn't always the right tool for the job.
The "infinite integration problem" shows up everywhere.
The challenge nobody expected: Knowing when to say "I don't know."

Whether these themes resonate or you simply want to know what's up ahead as you build your first AI products, my goal today is to show you how teams are gaining experience and overcoming challenges.

Small, Cross-Functional Teams Are Shipping Fast

Most of the teams that I talked to are surprisingly small.

At Arize, SallyAnn DeLucia (director of product with a data science background) and Jack Zhou (staff engineer who'd briefly been a PM) built Alyx, their AI agent, as a core team of two. They had advisors—their CEO Jason Lopatecki was involved daily, plus one of their best solutions architects—but the primary building was done by this product-engineering pair. What made them effective? Both had boundary-spanning experience. SallyAnn was technical enough to lean in heavily with engineers. Jack understood product thinking from his PM stint.

At Incident.io, Lawrence Jones (founding engineer and former SRE) partnered with Ed Dean (product lead for AI) to build their multi-agent investigation system. Ed had moved from a data role into product on their core Response product before taking on AI. This product-engineering partnership model meant they could make real-time decisions about what was technically possible and what would actually solve customer problems.

At eSpark, there were three members of the core team: Thom van der Doef (principal product designer), Mary Gurley (director of learning design and former teacher), and Ray Lyons (VP of product and engineering)—all with more than ten years at the company. What made them effective? Mary and Thom, despite their non-engineering backgrounds, started writing eval code in Python with LLM assistance. The learning design team—all former teachers—brought pedagogical expertise that shaped how they evaluated quality.

AI products are being built by small teams where everyone understands multiple disciplines and nobody is precious about staying in their lane.

Nobody Started as an AI Expert

Stack Overflow's story was particularly striking. When Ellen Brandenberger formed their AI team in early 2023 (right after ChatGPT launched), she told me: "We knew nothing." Ellen's background was in qualitative research and product management—not software engineering, not AI. Her team were not experts on LLM products.

Their approach? Weekly experimentation with "lunch and learns" on Fridays to share what they'd discovered. They time-boxed learning to avoid overwhelm. Ellen described it as "take one bite of the apple at a time." In 3 months, they ran around 50 experiments across 5 different pods, documenting everything.

The eSpark team had to learn RAG, embeddings, and vector databases from scratch—concepts none of them had encountered before. As Ray described it: "We did what everybody does and pulled up some podcasts and figured out what are some of the best practices." They used an LLM to learn Python.

Now, three years after ChatGPT launched, they've shipped four LLM-based features and can confidently ship daily curriculum updates using eval-based go/no-go decisions.

At Arize, SallyAnn and Jack started building in November 2023 when GPT-3.5 was, in Jack's words, "50% hallucinations, 50% something useful." Agent frameworks didn't exist yet. Most products didn't have an AI copilot yet. They had to invent their own testing approaches because the tooling they now build for customers didn't exist yet for them to use.

The learning curve is steep, but it's achievable. These teams are now shipping sophisticated AI products, and they all started from square one.

Domain Expertise Drives Product Decisions

The eSpark story illustrates how domain expertise shapes AI product decisions in ways that generalists would miss.

When they started building their teacher assistant with embeddings (a common AI search strategy), they ran into a problem. Queries that made perfect sense in an educational context—like "long A" for phonics—were too generic for embeddings. They matched too many irrelevant results.

The team realized they needed to add metadata to their educational content. This is where their pedagogical expertise became critical. Ray told me they built an asynchronous LLM pipeline that took their sparse existing data and generated rich metadata—but the schema for what to generate came from Mary and her instructional design team. They knew what mattered for teaching: learning objectives, keywords, topics (like "phonics"), subtopics (like "long E"), and domain classifications (to distinguish "algebraic thinking" from "geometry" concepts with similar names).

The instructional design team's pedagogical expertise also shaped how they evaluated their AI. Mary told me: "As former teachers on the team, we asked, 'How did we use to score student work and student output?' We created rubrics." They already knew how to design rubrics to evaluate student output, so it was natural for them to design rubrics to evaluate LLM output.

At Zencity, domain expertise in local government operations shaped every product decision.

Andrew Therriault, their VP of data science, is a former local government official. That background meant he understood workflows like annual budget planning, five-year strategic planning, Compstat performance tracking, and crisis management.

One example showed why generic AI wouldn't work for local government. Imagine a city posts on social media: "We're fixing the pothole problem on Main Street." A resident comments: "This is a really big problem!"

Generic sentiment analysis would classify that comment as negative. But Andrew's team trained a custom sentiment model for local government context. That comment is actually positive sentiment toward the government—the resident is acknowledging the problem and trusting that the city is addressing it. Generic AI would miss that nuance entirely.

At Incident.io, Lawrence's background as a former SRE shaped how their AI investigates incidents.

The inductive-deductive reasoning loop, the way findings lead to hypotheses which lead to questions—that's exactly how experienced SREs debug production issues. Lawrence didn't invent a new investigation pattern. He encoded the expert pattern into the AI system.

Teams may not start out as AI experts, but they are relying on their domain expertise to build high-quality products.

Starting Narrow Beats Starting Broad

The Xelix team's approach exemplified this pattern. They're building a help desk for accounts payable teams that handles 1,000+ emails per day. The temptation would be to build an AI that handles everything.

Instead, Talal A. spent days manually tagging emails to understand the breakdown. They did a 20% analysis of email types by frequency and discovered that invoice reminders were the most time consuming and had the clearest automation value. So they built a dedicated pipeline just for that category.

They're building narrow verticals—one pipeline per category—rather than trying to make a single AI handle everything. Start with the highest-value use case, get it working reliably, then expand.

Nurture Boss, an AI assistant for apartment complexes, helps property managers handle everything from tour scheduling to maintenance requests. Founded by Jacob Carter (CEO and engineer) with a co-founder who had experience servicing residents, the small team of five to six people could have built an agent that tackled every use case at once.

Instead, they started with just tour scheduling. "Tour scheduling is a big thing," Hamel Husain, who consulted with the team, explained. "You shouldn't have to call a human to do tour scheduling." They tested this narrow workflow with design partners before expanding to general questions about properties, specials and pricing, and maintenance requests. The team started narrow with tour scheduling and expanded gradually as they added capabilities.

At Neople, they build AI coworkers for e-commerce customer support. Today, their product can suggest responses, send automated responses, and take actions like canceling orders or processing refunds—even actions beyond customer service in finance, operations, and marketing workflows.

But they didn't start there. They started with the simplest possible version—just suggested responses. Their AI would generate a support ticket response, and the customer service agent would see two buttons: "copy message" (to edit before sending) and "send directly" (automation).

Seyna Diop, their CPO, told me they monitored which button customers clicked. When customers consistently chose "send directly," that gave them the confidence to add full automation. They used their beta users' behavior to decide what to build next.

The progression was deliberate—suggestions, then automation, then actions within customer service, then actions beyond customer service. Each phase built on lessons from the previous one.

Arize took a similar approach. They had many potential skill ideas for Alyx, their AI agent. After eight months of development, they launched with just a handful of carefully chosen skills.

How did they choose? SallyAnn walked me through their prioritization framework: "What will the LLM be good at plus what provides immediate value?" They sorted customer issues by frequency and similarity, then focused on the highest pain points—prompt optimization (where users spent the most time), eval template creation (a major blocker), and AI search (where semantic understanding added value beyond simple queries).

While many of these teams have built complex AI products—each with a large footprint—they didn't start by trying to do everything at once. They found one small use case and grew from there.

Architectures Are Getting More Sophisticated—But in Different Ways

Arize's architectural journey shows how teams are evolving from simple to more complex systems.

They started with what Jack called an "on-rails" architecture where they guided the paths an LLM could take when responding. Their workflow allowed an LLM to select from four tools at the top level, then choose sub-skills underneath. The workflow didn't have a lot of flexibility beyond identifying the right tool.

Jack explained this approach made GPT-3.5's limitations manageable. This structured approach simplified what the LLM had to do, which was critical when working with a model that was still very error prone.

But now they're building their next-generation architecture, an agent that follows a plan-first process. It's similar to how Cursor works—the agent starts with an initial planning step before execution, has the ability to revisit and revise plans dynamically, and has the freedom to take tangents as needed. This allows the agent to take more varied paths based on what the end user needs.

SallyAnn walked me through an example workflow for prompt optimization. The agent will:

Check if an eval exists
If missing, ask the user what they're trying to achieve and create an eval
Run the baseline eval (might show 30% hallucination rate)
Examine the bad examples
Rewrite the prompt
Re-run the eval
Reflective loop: Did it improve? If not, try a different strategy

The agent can take tangents—like writing an eval if one doesn't exist—rather than following a rigid path. It has more range of motion to handle questions outside the predefined structure.

Trainline took a different path—they went fully agentic from day one, but grew the agent's capabilities over time.

Matt Farrelly, their head of ML engineering, told me: "We took a very bold approach." They launched with a single central orchestrator in October 2024, using a loop-based architecture where the agent decides when it's done. In a loop-based architecture, the agent (the central orchestrator) is given a task, takes an action, evaluates its progress toward achieving the task, and then runs through the loop again until it determines it has achieved the task. The actions the agent can take are determined by the tools the agent is given access to.

While their architecture was agentic from the start, they started with minimal tools. Their proof of concept had just a simple vector database, a terms and conditions tool, and a mock refund endpoint. Matt explained: "We wanted to check if it could choose when to call terms and conditions and when to do a more general information check."

The minimal viable product focused on frequently asked questions and ticket terms and conditions—that's it. But the agentic architecture meant they could grow their tools over time without rearchitecting. They've since expanded to tools that give the agent access to 700,000 pages of curated material, real-time train positioning, disruption feeds, and multiple action tools.

Incident.io's architecture is the most sophisticated I encountered. It's a multi-agent orchestration system with dozens of agents working in parallel.

Lawrence described it as mimicking how humans investigate incidents. The main reasoning agent acts like an incident lead, while specialized sub-agents work independently—like SREs on a team.

The system implements an inductive-deductive reasoning loop:

Collect findings (concrete observations backed by evidence)
Generate hypotheses (potential explanations with narrative chains)
Create questions to test those hypotheses
Have sub-agents investigate specific questions
Continuously refine as new information arrives

They have an ambient agent that monitors Slack channels. When humans post new findings, the system incorporates them and re-evaluates hypotheses. Multiple "turns" through the investigation as understanding evolves.

Lawrence highlighted that there's "little difference between us running all these agents concurrently and actually modeling the incident channel as just a collection of humans that are just another agent." The architecture mirrors human team collaboration.

As teams learn and get into more complex problems, their AI architecture grows in complexity as well.

Evals Evolve from Simple to Sophisticated

Every team started with simple evaluation approaches and evolved over time.

Stack Overflow evolved from spreadsheets to a full-blown benchmark.

Stack Overflow began with Google Sheets. Ellen described the setup: "We'd log user questions in one column, the LLM responses in another, and capture thumbs up/down ratings. We'd group results by Stack Overflow tags—Python, frontend, DevOps—to see where quality varied by domain."

Next, they tested around 300 questions with subject matter experts—10-15 people across different technical domains. They tracked accuracy, relevance, and completeness separately because they learned that a response could be technically accurate but not relevant to the user's specific context (Python 7.0 vs. 8.0, for example).

But here's where it gets interesting: Those learnings informed their next product, which was successful. They pivoted to data licensing and built an industry-grade benchmark. Their benchmark uses a golden dataset of around 250 questions refreshed monthly, with multiple layers of protection against data leakage. They use questions that aren't in Stack Overflow's public corpus and verify that third-party models haven't seen the data.

Arize—an eval tooling company—also started in spreadsheets and iterated from there.

The evolution at Arize was particularly interesting. SallyAnn told me about their early approach. They used Google Docs to do manual comparisons of examples and outputs. This is a company that builds eval tools for others, and they started exactly where everyone else does—with the simplest thing that could possibly work.

They ended up evaluating at multiple levels. SallyAnn walked me through it: "We have QA correctness at the top level, but then each individual step has an evaluation. Did it call the right tool? Did it call the tool with the right arguments? Did it accomplish the task for that tool correctly? Did the data actually have something in it? And then we have trace—did it call all the right tools in the right order? And then session—is my user getting frustrated as it keeps going back with Alyx and Alyx keeps getting it wrong? Are we losing memory?"

SallyAnn told me about their process: "We got all stakeholders in a room—product, engineering, security—and identified what each cared about. Product cared about user experience. Engineering cared about task correctness. Security cared about jailbreak detection. Then we built evals for each of those concerns."

They also focus evals on decision points rather than trying to evaluate everything. SallyAnn described their audit process: "I'd go through traces with my iPad and note every place where Alyx made a decision or I had a question. Then we'd abstract those into evaluation criteria."

Trainline evolved from human evaluations to a sophisticated user-context simulator.

Trainline went through a similar journey from simple to sophisticated. At launch, they started with human evaluation and small-scale automated tests. Matt explained: "We built this product really quickly. So we first leaned on human evaluation and what I'd call small-scale automated tests—we had certain tests where we do some semantic checks to say, yeah, this response is quite similar to this."

They outsourced red-teaming to a third party for toxicity and safety checks, and ran internal red-teaming rounds as well. But they knew this wouldn't scale. As Matt put it: "Where do you get that data set from? Large-scale AI companies are hiring teams of labelers. For Trainline, that's not an option on a large scale. Customers could ask us 10 million different questions."

So they moved to LLM-as-judge. Matt had been researching it before they even built the assistant and was impressed by the human alignment researchers were achieving. They adopted Braintrust and set up judges for their four key principles: groundedness, relevance, helpfulness, and consistency.

But even that didn't solve everything. Trains are context-specific. How do you evaluate an agent's response when someone's train to Manchester is delayed by five minutes and they're anxious?

So they built something I hadn't heard of anywhere else—a custom "user context simulator." They buy real train tickets in their test environment, then sample anonymously from real user queries and create "lookalike queries" based on that ticket context while the real train is moving. One ticket generates 100-1,000 test questions.

Because they're using real trains moving in real time, they can test their agent against live scenarios offline. This solves the context-specific evaluation challenge that many agentic systems face.

Nurture Boss evolves from error analysis to code assertions to LLM-as-judges.

When Hamel started consulting with Nurture Boss, they were stuck. "They had built all this stuff, it was wired together nicely, and it did kind of work. But there were problems, and they didn't know how to get better."

They started simple by reviewing production traces and adding notes about errors. "You don't want to spend ten minutes on a trace," Hamel explained. "Try to spend like a minute. Document the biggest elephant in the room and move on." They used an LLM to categorize these notes, then created a pivot table in Excel to count error frequencies. "Counting is incredibly powerful," Hamel noted. "It turns out that is one of the most powerful tools when it comes to debugging LLMs."

From there, they got more sophisticated. For date errors in tour scheduling, they wrote code-based assertions—deterministic checks that a tour was scheduled for the correct date. For subjective problems like "Should the agent have handed off to a human?", they built an LLM-as-judge.

But Hamel emphasized knowing when not to over-engineer, "Some issues are just bugs. Just fix it and move on. The art is knowing which problems need rigorous evals and which just need obvious fixes."

Everybody is learning what evals look like as AI product complexity grows. Most teams went through several iterations and are still learning what works.

The Trendy Tool Isn't Always the Right Tool for the Job

One of the most valuable lessons emerging from these conversations: The newest, trendiest technology isn't always the right choice. These teams are making thoughtful decisions about when to use LLMs, when to use embeddings, when to use traditional code, and when to combine them strategically.

Incident.io discovered this with their retrieval system—but in an unexpected direction.

They started with pgvector in PostgreSQL, embedding documents for semantic search. Lawrence, their founding engineer and former SRE, thought embeddings would be the right approach. But they ran into problems.

"Vectors are totally inscrutable," Lawrence told me. "When you get wrong results back, your question is not like 'Why did that vector not match?' It's just numbers in a database. When you need to debug why something went wrong, you're stuck."

They also found that vector similarity would trigger on irrelevant or fuzzy matches. The chunking and embedding quality issues required an additional re-ranking layer. And versioning vectors when document formats changed became a pain point.

So they walked it back. Their current approach uses frontier models to generate free-form text summaries or tags/keywords from documents, stores those in PostgreSQL with standard database indexing, then uses text similarity for initial filtering. It's deterministic and debuggable. Then they use LLM re-ranking for final selection of top results.

They moved away from the trendy embeddings approach toward a combination of LLMs and traditional database indexing because it was simpler to debug. That's more valuable than cutting-edge technology that's impossible to understand when it fails.

Neople took a different approach—matching the retrieval strategy to the data type.

Job Nijenhuis, their CTO, explained their agent has access to multiple search strategies—semantic search, keyword search, and hybrid search. The agent also understands different document types. FAQs work best with semantic search on just the question. Product feeds work best with keyword search. The agent selects the right tool based on what it's searching.

Even their embedding strategy varies by document type. For FAQs, they embed only the question (not the answer) to get better matching. For product feeds, they rely more on keyword search than embeddings.

Zencity shows another variation—separating reasoning from facts.

Andrew, their VP of data science, explained their breakthrough: "We realized we can't give numbers to the LLM that we couldn't verify with the required consistency. What we ended up doing is essentially telling the LLM, just give us essentially a placeholder, an anchor for the actual data that you want to insert. And then at the end of the process, we basically copy it in."

For example, when including anecdotes in their reports, they tried passing data to the LLM. Sometimes it was verbatim, sometimes slightly reworded, sometimes it made something up that sounded like what they'd given it. So they realized they needed to provide the exact quote and a link to the source deterministically—not AI-generated.

The LLM handles reasoning about what information matters. The system handles retrieving and inserting the facts. This separation means they can trust the data while still leveraging LLM capabilities.

At Xelix, they built a sequential pipeline where each step is narrow and validated. Machine learning models handle vendor matching and invoice retrieval, producing probability scores at each step. Data gets enriched from multiple sources. An LLM synthesizes everything into a final response.

Claire, their AI engineer, explained they use confidence scoring that combines extraction quality, match probability, and overall context assessment. If confidence is too low at any step, they don't generate a response—they flag it for human review instead.

Traditional code handles structured data extraction and matching. The LLM handles natural language synthesis. Each tool does what it's best at.

The pattern: These teams aren't chasing the latest AI technique. They're choosing the right tool for each specific job—even if that means using "old-school" keyword search or deterministic code alongside cutting-edge LLMs.

The "Infinite Integration Problem" Shows Up Everywhere

AI products rarely work in isolation. They need data from multiple sources—internal systems, third-party tools, customer databases, real-time APIs. But every customer uses a different tech stack. There's no single integration that solves everyone's needs. Instead, there's a long list of systems to connect to, each with their own quirks, formats, and APIs (or lack thereof).

Teams are tackling this in different ways, depending on their product needs.

At Neople, this challenge was critical to their success. They build digital coworkers for customer support in e-commerce. And customer support, as Job their CTO explained, interfaces with all internal systems: "To give a good suggestion, you needed to integrate with ten different systems and every customer has ten of their own systems."

Their first solution was to partner with another startup that specializes in integrations. This got them access to common systems. But there's always a long tail of custom internal tools that don't have APIs.

Their second solution was to leverage agentic browser capabilities that can interact with systems that don't have APIs. An AI agent literally uses a browser like a human would, clicking through interfaces to execute actions. As Seyna described it, this allows customers to "work where they work from the get-go."

At Trainline, the integration challenge was different but equally complex. They needed data from multiple rail providers across the UK and EU, each with their own formats for schedules, disruptions, and terms and conditions.

Billie Bradley, their PM, explained they scaled their information retrieval dramatically—from 450 documents to 700,000 documents as they expanded what information their agent could access.

They use metadata tagging to manage information hierarchy. Some information is regulatory (required by law). Some is carrier-specific enhancements (this carrier offers more compensation than regulations require). The agent needs to understand which source takes precedence.

They also built a dedicated LLM judge in their information retrieval service. When the retrieval system finds the top ten most relevant pieces, the judge reviews them and assesses whether they're actually relevant to the query. Only the truly relevant context gets passed to the orchestrator. This prevents information overload while ensuring quality.

Incident.io had a different advantage. They'd already solved part of this problem for their core product.

Lawrence, their founding engineer, explained their position: "We arrived at trying to build these AI systems with already a huge established connection pool to almost any integration. We've got all your issue trackers. We even have a connection to your HR tools so that we can find holiday clashes with your paging provider."

Customers already connect almost all their tooling to Incident.io for incident response—GitHub, GitLab, Slack, Datadog, Grafana, and dozens more. They leverage that existing integration pool for the AI SRE.

They still face the long tail of different data sources across their customer base. But they prioritize new integrations based on customer pull and early-stage collaboration rather than trying to build everything upfront.

The pattern across all these teams: There's no "one integration to rule them all." You either partner with integration platforms, build prioritized subsets based on customer needs, or create abstraction layers (like browser automation) that can adapt to different systems.

The Challenge Nobody Expected: Knowing When to Say "I Don't Know"

LLMs are trained to be helpful. They want to give you an answer. But sometimes the right answer is "I don't know"—and getting an LLM to admit that turns out to be surprisingly hard.

This isn't about traditional hallucinations where models make up completely false information. Instead, it's about something more subtle—the tendency to stretch insufficient data into a confident-sounding answer when uncertainty would be more appropriate.

Multiple teams identified this as a critical challenge they had to solve. Lawrence at Incident.io told me, "Positivity bias is the most common 'aha moment' when onboarding new team members to AI work." LLMs try too hard to give answers even with insufficient data. If you tell an LLM to return three to five points, it will return three to five points, fabricating some if necessary.

Lawrence explained the challenge: "We will dial down what we're claiming based on the level of confidence and uncertainty that we have. And this stuff is so important for a system like this—you can't build one without it."

Their solution is to build uncertainty and confidence management into the core architecture, not as an afterthought. The system critiques its own understanding, identifies what it's uncertain about, and generates questions. As Lawrence explained: "What questions that if only I had the answer to those questions, I would be able to discount, rule out, or strengthen my hypothesis?"

At Neople, this was even more fundamental. LLMs are trained on vast knowledge and want to use all of it. But Neople needed their agent to only use provided customer context and say "I don't know" when information wasn't available.

Job, their CTO, explained the challenge: "Large language models are really trained to have this gigantic brain full with previous knowledge. And that's really annoying to us. Like it makes it amazing and it makes it brilliant, but it makes it also really annoying."

For example, if a customer asks "Where is the Eiffel Tower?" Job wants the system to say "I don't know" unless that information is explicitly in the customer's knowledge base. It's important that the model only provides answers based on the company's specific context, not based on general knowledge. They had to fight against the model's instinct to be helpful by using everything it knows.

Their solution combined multiple approaches—agentic retrieval with confidence scoring, extensive evals before responses are sent (hallucination detection, refusal to answer detection, grounding verification), and a fail-safe where any eval failure sends the response as a suggestion to a human agent instead of automating.

But Zencity had my favorite example of getting this right.

Shota Papiashvili told me about testing their AI Assistant v2 with Boston's CIO. The CIO asked: "What's the favorite ice cream flavor for the Bostonians?"

The system had zero data about ice cream preferences. This was a test question—would the AI make something up?

The response: "I don't know. I have no such information."

As Shota put it: "That worked well for us." After all their work on guardrails, after extensive prompt engineering ("Don't guess, don't try to make up data. For every fact you're putting in the text, cite the exact data point"), the system correctly admitted ignorance.

That "I don't know" represented hours of work getting the system to behave correctly. It's not a sexy feature. But it's the foundation of trust.

What This Means for the Rest of Us

These teams aren't working with secret knowledge or massive resources. They're learning by doing, starting small, and iterating quickly based on real usage.

And they're refreshingly honest about failures and false starts.

Stack Overflow built four iterations of conversational search over several months. They progressed from a chat interface on keyword search, to semantic search, to combining semantic search with GPT-4 fallback, to full RAG with attribution. Despite all that work and learning, they couldn't get above 70% accuracy. Ellen and her team made the hard decision to roll it back entirely.

But those lessons directly informed their next product, which was successful—data licensing with industry-grade benchmarking. As Ellen recounted, one of her team members reflected: "I think we had to go through that because otherwise we would have never been able to get where we are." The failed product taught them what they needed to know.

Teams iterate rapidly and aren't afraid to pivot. Arize built a local web app for testing early in development. SallyAnn described it bluntly: "It was so bad. It never worked." So they pivoted to scrappy notebook-based testing that let them iterate rapidly without UI overhead. Sometimes the right solution is simpler than you think.

At eSpark, the team built an open text chatbot interface first, assuming teachers would interact like ChatGPT power users. It took four to five teacher interviews to realize the interface was wrong.

Ray explained the insight: "Teachers had this trepidation, like the empty text box... how do I ask it the right way?"

So they replaced the open text with a structured dropdown. Teachers select their lesson from the core curriculum, and relevant activities appear immediately. The AI generates three follow-up question suggestions for continued exploration. Fewer clicks, no guessing what to type, immediate value.

Ray summed it up: "We assumed users would behave like us—frequent ChatGPT users. But teachers are 'rule followers' who needed more structure."

The teams that are succeeding share a few key traits:

They start with real customer data and real problems. Xelix's PM tagged thousands of real emails. eSpark tested against actual student learning content and teacher workflows. Arize used real customer data from day one.

They build evaluation into the process from the start, even if it's just spreadsheets. Stack Overflow started with Google Sheets. Arize started with "a Google Doc with a bunch of examples" where SallyAnn and Jack would manually compare outputs. The sophistication came later, but measurement started immediately.

They're ruthlessly focused on specific use cases rather than trying to solve everything. Arize launched with just a handful of carefully chosen skills. Xelix focused on just invoice reminders rather than trying to handle all email types. Neople started with suggestions only, then gradually expanded.

They stay close to customers and iterate based on actual usage patterns. Neople monitored which buttons customers clicked to decide when to add automation. eSpark had to pivot on their UX after watching teachers use their product. Trainline discovered the real-time reassurance use case only after launch.

They recognize that domain expertise shapes what AI can realistically do. The former teachers at eSpark, Andrew's experience as a former local government official at Zencity, the former SRE at Incident.io—domain knowledge determined what they built and how they evaluated quality.

And perhaps most importantly: They're comfortable getting uncomfortable.

Trainline's David told me that exactly: "Evaluation remains really, really subjective. We're getting comfortable with getting uncomfortable." You can't ship AI products knowing exactly how they'll operate. The iteration step has moved earlier in the development cycle. Non-deterministic outputs require new approaches to quality assurance.

Billie, also from Trainline, added: "You can't rely on 'When I previously built this, this wasn't possible.' Three weeks later, suddenly it is possible. You have to constantly pivot and adapt."

But here's the encouraging part: Teams are figuring it out. In real products. With real customers. And they're sharing what they're learning along the way.

The path forward isn't about mastering AI before you start. It's about starting small, measuring what matters, staying close to customers, and being honest about what's working and what's not.

These teams are proving it can be done.

Want to hear these conversations in full? Check out Just Now Possible where I talk with product teams about how they're building AI products.

Audio Version

The audio version is only available for paid subscribers.

Lessons from 9 AI Product Teams: Emerging Themes from Just Now Possible

Small, Cross-Functional Teams Are Shipping Fast

Nobody Started as an AI Expert

Domain Expertise Drives Product Decisions

Starting Narrow Beats Starting Broad

Architectures Are Getting More Sophisticated—But in Different Ways

Evals Evolve from Simple to Sophisticated

Stack Overflow evolved from spreadsheets to a full-blown benchmark.

Arize—an eval tooling company—also started in spreadsheets and iterated from there.

Trainline evolved from human evaluations to a sophisticated user-context simulator.

Nurture Boss evolves from error analysis to code assertions to LLM-as-judges.

The Trendy Tool Isn't Always the Right Tool for the Job

The "Infinite Integration Problem" Shows Up Everywhere

The Challenge Nobody Expected: Knowing When to Say "I Don't Know"

What This Means for the Rest of Us

Audio Version

Read next

How to Use Claude Code Safely: A Non-Technical Guide to Managing Risk

Stop Repeating Yourself: Give Claude Code a Memory

Claude Code: What It Is, How It's Different, and Why Non-Technical People Should Use It

Make better product decisions.

Small, Cross-Functional Teams Are Shipping Fast

Nobody Started as an AI Expert

Domain Expertise Drives Product Decisions

Starting Narrow Beats Starting Broad

Architectures Are Getting More Sophisticated—But in Different Ways

Evals Evolve from Simple to Sophisticated

Stack Overflow evolved from spreadsheets to a full-blown benchmark.

Arize—an eval tooling company—also started in spreadsheets and iterated from there.

Trainline evolved from human evaluations to a sophisticated user-context simulator.

Nurture Boss evolves from error analysis to code assertions to LLM-as-judges.

The Trendy Tool Isn't Always the Right Tool for the Job

The "Infinite Integration Problem" Shows Up Everywhere

The Challenge Nobody Expected: Knowing When to Say "I Don't Know"

What This Means for the Rest of Us

Audio Version

Read next

Make better product decisions.